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VLSI  FLOATING  POINT  CHIP  DESIGN  STUDY 


1.0  In '•  roduc t  i  on  and  Summary 

This  report  summarizes  the  results  of  a  study  undertaken  by 
the  Hughes  Research  Laboratories  to  investigate  techniques  for 
VLSI  implementations  of  arithmetic  algorithms.  Wh i le  most 
attention  in  the  VLSI  era  has  gone  to  activities  related  to  the 
shrinking  of  design  rules  and  the  introduction  of  sophisticated 
architectures,  there  is  still  a  large  contribution  that  is 
aval  1 ab I e  from  the  use  of  novel  algorithms  for  performing  the 
arithmetic  computations  that  underlie  a  I  I  computations.  This  is 
evipent  from  the  variety  of  new  arithmetic  elements  seen  In  the 
newest  generation  of  microprocessors,  digital  signal  processing 
chips,  special  function  ari thmet ic  chips  (e.g.,  multiplier, 
div  der) ,  and  co-processor  chips. 

Witnin  the  domain  of  signal  processing  there  are  two 
ar,tnmetic  operations  of  particular  importance,  division  and 
sduai'e  '"oot  .  These  functions  are  associated  with  such  algorithms 
as  Singular  value  decompositions  (SVD) ,  Givens  rotations,  and 
varo^s  other  orthogonal  transformations.  In  this  report  we 
investigate  techniques  for  performing  division  and  square  root 
tnat  are  suitable  for  VLSI  implementation. 

I ri  Section  2  we  describe  an  algorithm  for  performing  area- 
time  efficient  division  based  on  a  serial/paral lei  organization. 
We  feel  that  this  circuit  considerably  advances  the  state-of-art 
because  our  study  shows  that  we  can  compute  at  rates  as  fast  as 
SDec.ai  hardwired  parallel  circuits  that  use  an  order  of 
magnitude  more  area.  These  conclusions  are  based  on  the 
com pa  rison  of  division  speed  capabilities  shown  in  Table  1  for 
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Table  1 .  Comp a  risen  of  Division  Capabilities 


i  1 

!  CHIP  (SPEED  (Msec) 

i  1 

PRECISION  1  1 

(BITS)  1  COMMENTS  ( 

1  1 

!  1 

lintel  8087  |  39 

1 (Slave  Processor) | 

1  1 

[  2 ' 

64  (Fit.  Pt . )  1 280  X  280  mi \^\ 

1  (nMOS)  1 

1  1 

1  1 

|NS  16081  1  8.9 

((Slave  Processor)! 

1  1 

I  1 

32  (Fi X  Pt.)  1  nMOS  | 

1  1 

1  1 

1  1 

(HP  11.2-2.4 

((single  chip  ( 

1  1 

1  1 
64  (Fit.  Pt.)  1  CMOS/SOS  „| 

(210  X  290  mi  ni 

1  .  1 

1  1 

I  We  1 tek  (  5  -  10 

!  (2  chip  set)  1 

i  1 

24  (Fit.  Pt.)  (300  X  290  Mu  1 t | 

(305  X  225  ALU  | 

1  .  .  1 

1  ( 

;  99000  1  4.9 

Im  processor  ( 

1  1 

16  (Fi X  Pt.)  1  1 

1  1 

1  1 

1  HUGHES  1  0.88 

1 _ 1 _ 

1  2  ' 
32  (Fix  Pt.)  (3m  50x150  mi^l 

_ ^ _ L 

monolithic  processor  chips  Here,  our  calculations  assumed  an 
NMOS  3-micron  technology,  which  is  typical  of  that  used  in 
D''esent  day  commercial  semiconductor  products.  Unfortunately, 
timers  are  not  a  lot  of  special  purpose  divider  chips  on  the 
marKet  with  which  to  make  comparisons.  Considering  that  our 
circuit  consumes  a  minimal  amount  of  area  because  of  its  serial - 
parallel  organization  (shift-and-subtract),  it  is  clearly  much 
more  area-time  efficient  than  any  of  the  other  circuits  For 
e'amp'e,  the  Hewlett  Packard  circuit  consumes  an  entire  chip 
about  35,000  transistors)  yet  is  half  as  fast  We  have  been 
ass'Sted  in  uhis  activity  by  Prof.  Milos  E  rc ego vac  of  UCLA ,  who 
performed  the  various  algorithmic  investigations. 
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In  Section  3  we  describe  "on-line"  techniques  for  bit-serial 
calculations.  The  advantage  of  this  approach  is  that  parallelism 
is  achieved  by  overlapping  arithmetic  operations  at  the  digit 
level  and  the  associated  feature  that  communication  lines  are 
needed  for  only  one  digit  for  each  operand  mantissa.  Our 
Investigation  of  the  on-l i ne  approach  was  based  on  the  numerical 
requirements  imposed  by  a  rea  st i c  algorithm,  singular  value 
decomposition  (SVD)  The  ts  c  this  study,  carried  out  by 

Paul  Tu  of  UCLA ,  indicate  that  me.  do  la'-,  efficient  on-line 
approaches  to  complex  algor  tnms  are  promising.  However,  more 
research  work  is  necessary  be'^ore  aetai  led  implementations  are 
possible.  In  particular  it  w.  I  I  be  necessary  to  determine  how  to 
deal  with  variable  delays  introduced  by  certain  floating  point 
operations,  such  as  addition  and  subtraction.  It  was  also 
concluded  that  fixed  point  operations  would  be  difficult  for 
comp lex  on-line  calculations  because  of  the  requl rement  that  all 
operands  fal I  within  a  certain  range,  which  would  be  difficult  to 
monit-or  during  the  course  of  the  calculation. 

Finally,  in  Section  4  we  describe  a  variety  of  i te ra t i ve 

2  2 

algc'ithms  for  calculating  the  square  root  of  a  number  a  +b  , 
which  is  a  very  important  calculation  due  to  the  need  for 
obtaining  the  sines  and  cosines  used  in  Givens’  rotations.  This 
approach,  using  suitable  approximations  for  the  first  estimate, 
can  in  fact  be  quite  fast.  We  have  looked  at  several  numerical 
square  root  algorithms,  all  of  which  use  some  form  of  polynomial 
app rox  ■ ma 1 1  on  to  the  first  result.  It  was  important  to  use 
techniQues  that  avoided  loss  of  precision  in  the  squaring  of  a 
and  b,  yet  at  the  same  time  did  not  invoke  the  use  of  conditional 
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branching.  We  found  that  by  using  a  4th  order  polynomial 
approximation,  only  one  iteration  was  needed  to  provide 
sufficient  accuracy  for  most  calculations.  The  total  time  to 
obtain  a  sin  or  cos  function  this  way  was  equivalent  to 
approximately  13  multiplies. 
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2.0  Division 


21  Introduction 

As  mentioned  earl ler,  division  is  a  very  important  operation 
in  signal  processing  because  it  is  found  in  so  many  of  the 
vai'  ous  matrix  factorization  techniques,  as  for  example  Gaussian 
e’  imination.  Even  though  only  a  sma I  I  fraction  of  the  total 
operatic  '  associated  with  any  given  algorithm  might  involve 
dtvsior,  it  still  must  be  given  considerable  attention  becausi 
the  division  (or  square  root)  operations  can  "bottleneck"  an 
ent  re  computation  when  concurrent  architectures  are  in  use.  An 
example  of  this  is  the  systolic  array  which  t r 1 angu I  a r i zes  a 
m.at'  X  .  In  such  an  array  only  a  single  border  cel  I  or  column  of 
bcraer  cel  Is  might  be  computing  divisions;  however,  since  al  I 
processing  elements  (PEs)  are  operating  in  lock  step,  the  slowest 
PE  will  determine  the  overal I  cycle  time.  What  is  needed  are 
division  and  square  root  operations  to  proceed  at  the 


m  u 

it'Piication  rate,  wh 

i  ch 

IS  general 

1  y  the  rate  1 

1  m  1  t  i  n  g 

factor 

i  n 

ail  the  othe  r  cells 

If 

th 1 s  we  re 

the  case  the 

system 

wou  1  d 

be 

m.a  X  I  m a  1  1  y  e  f  i  c  i  e  n  t . 

Tre  basic  problem  with  division  is  that  it  is  not  possible  to 
pipei'e  it  in  the  same  way  as  can  be  done  with  multiplication. 
This  1  s  due  to  the  i  nab  i  I  i  ty  to  l-now  what  the  quot  i  ent  digits  a  re 
ahead  of  t i me ,  W i th  mu  I t i p I  i cat i on  the  mu  1 t i p 1  i e  r  b i ts  are  k  nown 
anead  of  time  so  that  they  can  be  processed  at  any  convenient 
time  As  a  result,  our  approach  will  require  that  we  simplify 
tre  Quotient  selection  process  such  that  it  is  sufficiently  fast 
trat  pipelining  would  not  speed  't  up  even  if  it  were  available. 
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Our  divider  circuit  has  been  designed  in  such  a  way  that  it 
f  t  s  i nto  the  overall  f  ramework  provided  by  our  Multiplication 
□rented  Processor  (MOP)  chip^  which  already  contains  a  fast 
mu'tipl  I  cation  circuit.  Our  chip  design  approach  is 
characterized  by  a  number  of  ♦eatures.  First,  all  arithmetic 
c  '"cu'ts  are  serial/paraliel  S/P)  (e.g.,  shift-and-add  types) 
tnat  use  rad i x-4  arithmetic  and  have  their  own  set  of  dedicated, 
gn  speed  clocks.  The  S/P  organization  saves  a  large  amount  of 
space  comoared  to  a  fully  paral lei  design,  and  the  high  speed 
c  OCKS  and  radix-4  operation  are  intended  to  prevent  loss  in 
speed  compared  to  the  pure  paral lei  approach.  In  addition,  al I 
Co"  ar.cnmetic  algorithms  are  intended  to  be  based  on  some  form 
c*  carry-save  type  scheme  in  order  to  el iminate  carry  propagation 
ac"C)SS  the  fuM  precision  of  the  word.  Each  arithmetic  circuit 
"as  ts  own  set  of  dedicated  control  hardware,  so  that  all  the 
c"09"ammer  is  required  to  do  is  supply  the  arithmetic  unit  with 
tne  apbropriate  operands.  The  high  speed  clocks  are  synchronized 
w  or  respect  to  the  slower  system  clocks  which  are  responsible 
■^or  transferring  data  on  chip  and  between  chips  We  expect  there 
w:  pe  4  to  8  high  speed  clock  cycles  per  slower  system  cycle. 

Al  c'ocks  are  of  the  two-phase,  non-overlapping  variety. 

There  are  two  basic  approaches  to  performing  division  in  a 
conventional  way.  The  iterative  or  successive  approximation 
tecdrlQues  use  a  fast  mu  I t i p I  ler  to  achieve  quadratic  convergence 
"ates  Often  one  can  use  a  I ook-up  table  to  provide  a  good  f i rst 

1  J  G  Nash  and  K.  Petrozol in,  "VLSI  Implementation  of  a  Linear 
Systol ic  Array,"  presented  at  ICASSP ,  March  1985,  Tampa, 
Florida, 
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estimat-e.  For  every  iteration  after  that  the  precision  of  the 
result  doubles.  Wh i le  this  technique  is  widely  used,  it  does 
exact  a  large  price  in  terms  of  hardware.  For  this  reason  it  is 
usea  primari ly  in  appi i cations  where  large  precision  is  required 
and  cost  or  chip  area  is  not  the  most  important  criterion.  The 
othe’"  popular  approach  is  based  on  recursive  techniques  in  which 
prec  s  on  is  proportional  to  the  number  of  recursion  cycles.  We 

■*'ee  that  this  is  the  best  approach  for  us  because  It  lends 

tse  *  to  very  a  ""ea-t  i  me-ef  f  1  c  i  ent  VLSI  implementation  and 

because  we  feel  that  we  can  obtain  division  rates  comparable  to 
our  multiply  times,  which  is  not  the  case  for  the  iterative 
tecnr  aue.  For  concurrent  a^'chitectures  the  issue  of  integration 
s  a  very  important  one  because  large  numbers  of  PEs  are  required 
to  ac-'eve  high  throughputs.  If  each  PE  were  excessively 
comp  ex,  then  the  overal  I  system  could  become  unwieldy. 

In  addition,  our  relatively  si mp le  S/P  bit-slice  arithmetic 
un  ts  pnovide  an  important  degree  of  system  modularity  in  that  It 
5  a  very  s  mp I e  tash  to  configure  a  PE  with  a  variety  of  options 
nso‘a^  as  arithmetic  units  and  memory  are  concerned.  This  Is 
•"pontant  Decause  we  feel  that  It  is  unlike',  that  a  single  chip 
type  w  ■'  sat'S’^y  a  variety  of  system  requirements.  In  other 
words  each  cor.current  system  implementation  might  be  Du  i  It  from 
the  same  bas i c  c  n i p  modu I es  (mu  I t i p I  i e  rs ,  d i  v i de  rs ,  adde  rs  , 
registers,  etc,,  but  trie  actual  PEs  wou  Id  be  different 
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Ae  rave  oonea  at  trree  ai'^*e'"ent  recurs've  a  vision 

9  -  3 

a  gorithms,  the  radix-2^  and  '"aaix-4  bRT,  and  tne  prediction 

4  5  6 

technique  of  Ercegovac .  The  SRT  approaches  are  the  Pas ' c 

no^-restoring  shift-and-subtract  techniques.  Tab'e  2  presents  a 
cooDar  son  these  approaches  in  terms  of  important  parameters 

and  figu'^es  of  merit.  Here  the  S/P  multiplier  is  used  as  a  basis 
^'or  comparison.  As  can  be  seen,  the  -adix-S  SRT  approach  is  the 
S'mpest,  but  is  the  slowest  On  the  other  hand  the  prediction 


c  ‘omotrison  d  '  V  1  s  i  on  ■  I  gon (n  =  number  of  bits)  in  term#  of  importmt  VLSI 
pa '^•'T>eter s  CPA  refers  to  i  precision  32-b't  add  times  to  propagate  carry 

Pe'at  ve  area  estimates  are  approirimate  at  best 
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technique  is  the  most  complex,  but  has  the  shortest  division 
time.  Al I  have  approximately  the  same  area-time  product.  Since 
a  basic  requirement  is  that  division  and  multiplication  times  be 
as  balanced  as  possible,  we  tend  to  weight  the  absolute  division 
time  more  highly  than  area-time  product.  We  have  assumed  for 
purposes  of  comparison  that  the  carry  propagate  adder  (CPA)  time 
is  125  nSec  (8  MHz)  and  gate  delays  are  6  nsec. 
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22  Recursive  Division  Algorithm  With  Prediction 


2.2.1  Introduction 

In  this  section  we  describe  the  division  algorithm  and  its 
implementation  in  terms  of  NMOS  circuitry.  More  detailed 
description  Is  provided  in  Appendices  A  -  C  containing  reports  by 
Prof.  Milos  Ercegovac . 

The  essential  requirement  necessary  for  a  fast  recursive 
divider  is  a  fast  recursion  time.  This  speed  is  usually  degraded 
by  the  time  required  to  perform  the  selection  of  the  quotient 
digit  for  a  particular  recursion  For  radix-2  operation  this  is 
not  too  difficult,  but  for  radix-4  approaches  this  becomes  more 
comp iex,  resulting  in  long  I oop  times.  Typically,  there  are 
three  basic  steps  in  each  recursion, 


R[l+1]  =  r(R[l]  -  q.X) 


(1) 


where  X  is  the  divisor,  R[i]  is  the  partial  remainder  of  the  I 
step,  R  [0] =Y  IS  the  dividend,  q.  is  a  digit  of  the  quotient,  and 
r  s  tne  radix.  Here  the  quotient  digits  must  satisfy 
p  £  Q.  _<  p  ,  p  being  digit  set  maximum.  These  steps  are  listed 
be i ow ■ 

Fo  rm  q . X 
1 

Subtract  and  shift  to  obtain  R[i+1] 


1 

2 

3 


Use  quotient  digit  selection  process  to  obtain  q^^l  from 


R[,^l] 

The  prediction  algorithm  uses  two  techniques  to  speed  up  the 
set  of  operations  described  above  First,  a  new  quotient  digit 
selection  procedure  is  introduced  that  requires  only  truncation 
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O'"  rounding  of  a  limited  precision  partial  remainder  to  obta  i  n 
q..  Second,  the  quotient  digit  Is  obtained  by  a vo 1 d i ng  exp  I  1 c 1 1 
evaluation  of  (1);  rather,  it  is  obtained  from  the  expression 


qj_^^  =  round  (r(R[l]  -  q.)) 


iwnere  R[i]  is  the  low  precision,  non- redundant  representation  of 

the  remainder.  Here  "round"  means  that  q.  ,  is  selected  from  a 

1+1 

rounded  and  truncated  version  of  the  result.  Since  q.  is  simply 
an  integer,  the  subtraction  is  simple.  Consequently,  the  maximum 
step  time  for  the  recursion  loop  Is  given  by  the  largest  delay 
time  associated  with  either  the  quotient  selection  process  or  the 
carry-save  evaluation  of  the  new  partial  remainder.  In  other 
words  there  are  two  separate  operations  occurring  at  the  same 
time,  so  the  step  time  is 


T  =  max(t  +  t  +  t,, 
a  q  I 


t  +  t  +  t 
s  cs 


w  h  e  e 

t  =  time  for  assiml  1  at i on  of  R[i]  (carry  propagation, 
3 

~6  bits) 

t  =  time  to  subtract  and  round 

q 

t  =  time  to  select  divisor  multiple 
s 

t  =  time  for  carry-save  subtraction 
t|  =  time  to  load  registers. 

In  order  to  perform  the  algorithm  In  this  way.  It  must  be 
d ■  V  I dea  in  to  two  parts,  range  transformation  and  recursion. 
F'-st,  for  the  simple  quotient  digit  selection  process  to  work, 
't  s  necessary  that  the  dividend  X  be  transformed  into  a  range 
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(1  -  a)  <  X*  <  (1  +  a) 

where  a  Is  a  number  on  the  order  of  2  ^  to  2  ^  and  X*  is  the 
transformed  divisor.  In  the  remainder  of  this  section  we 
describe  the  two  basic  parts  of  the  overal I  algorithm,  with 
particular  emphasis  on  circuit  implementation.  More  detailed 
discussion  is  provided  in  Appendix  A-C . 

2.2.2  Range  Transformation 

We  have  looked  at  several  approaches  to  performing  range 
fansf  ormat  I  on  (Appendix  A-C)  .  In  this  section  we  discuss  the 
ore  that  appears  most  promising,  that  based  on  reciprocal 
approximation  by  power  series. 

The  problem  here  Is  to  compute  a  transformed  division  X*  that 
satisfies  the  relation  |X*  -  1  |  _<  a  .  The  reciprocal 
approximation  approach  finds  a  multiplier  M  such  that  X*  =  XM ,  so 
that  this  relation  .5  satisfied.  To  do  this  we  divide  the 
a'v'sor  X  into  two  parts,  X^  and  and  set 

d  =  X^  .  2'*^  X^ 


w  n  €  ^  ^ 


"1  = 

(1 

II 

(N 

X 

\.2 

.  .  X^). 

Then  a  simple  series 

approx  1  mat  1  on 

g ' 

1  ves 

R  r 

1/D 

=  Rj  -  2- 

k 

(2) 
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where 


is  1/Xj^  truncated  to  u  bits 
^2  2 

is  (Rj)  truncated  to  s  bits 

^2  ^2  ^  bits 

e^.  is  the  truncation  error. 

Note  that  our  definition  of  assumes  that  it  is  normal ized 
so  that  the  most  significant  bit  is  a  "1."  Since  we  are  using 
fixed  point  arithmetic,  this  implies  that  a  shifter  network  will 
be  required  as  a  front  end  to  the  entire  divider  to  perform  this 
operation.  This  network  will  not  be  described  here  since  detai Is 
are  already  given  in  a  previous  report 

The  values  of  R^  and  R^  can  be  obtained  from  a  PLA  with  X^  as 
ar,  r.Dut  In  Table  3,  values  of  R^  and  R^^  are  given  for  the 
chcce  of  truncation  parameters,  k=5,  t=9 ,  u=3,  s=3,  and  ix=3 . 

Once  these  have  been  obtained  Equation  (2)  can  be  evaluated. 

Then  we  can  perform  the  mu  I 1 1 p I  ication  XR  to  yield  X*,  since  R  is 

an  apD '"op  r  I  a  te  t  y  accurate  representation  of  1 /X  .  This 

mu  I t i p I  cation  wl  I  I  be  performed  us  ng  the  same  carry-save  adder 


that  is  reou 1  red  in  the  second,  recursion  step 


7,  J  (j  Nash  and  G.  R,  Nudd,  "Des  ■  qn  Study  of  Floating  Point 
S/stoi 'C  VLSI  Chip,"  NOSC  Final  Report,  Contract 
No  N66001 -82-M-41 20 ,  September,  1983 
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e> 

Table  3,  Values  of  and  Rj  obtained  from  truncated  input  . 


xiXaTjr^rjrg 

^0^1^2*3*4^526*7^8 

i?2 

WqW^W2W2 

1.00000 

0.11111111 

0.111 

1.00001 

0.11111000 

0.111 

1.00010 

0.11110000 

0.111 

1.00011 

0.11101010 

0.110 

1-00100 

0.11100011 

0.110 

1.00101 

0.11011101 

0.101 

1.00110 

0.11010111 

0.101 

1.00111 

0.11010010 

0.101 

1.01000 

0.11001100 

0.100 

1.01001 

0.11000111 

0.100 

1.01010 

0.11000011 

0.100 

1.01011 

0.10111110 

0.100 

1.01100 

0.10111010 

0.011 

1.01101 

0.10110110 

0.011 

1.01110 

0.10110010 

0.011 

1.01111 

0.10101110 

0.011 

1.10000 

0.10101010 

0.011 

1.10001 

0.10100111 

0.011 

1.10010 

0.10100011 

0.011 

1-10011 

0.10100000 

0.011 

1,10100 

0.10011101 

0.010 

1.10101 

0.10011010 

0.010 

1.10110 

0.10010111 

0.010 

1.10111 

0.10010100 

0.010 

1.11000 

0.10010010 

0.010 

1.11001 

0.10001111 

0.010 

1-11010 

O.lOOOllOl 

0.010 

1.11011 

0.10001010 

0.010 

1.11100 

0.10001000 

0.010 

1-11101 

1  O.lOOOOllO 

0.010 

1.11110 

0.10000100 

0.010 

1.11111 

0.10000010 

0.010 
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Evaluation  of  equation  (2)  involves  a  3  x  3  multiplication 
(Rl  X2) ,  the  result  of  which  Is  shifted  by  four  digits  and  added 
to  .  It  appears  easier,  rather  than  to  complement  this  result 
and  add,  to  first  complement  R^  and  later  complement  the  final 
result.  With  this  scheme  we  can  begin  to  obtain  the  radlx-4 
multiplier  digits  of  R  immediately  since  there  is  no  addition 
being  performed  in  these  bit  positions,  as  shown  in  Figure  1. 
This  is  convenient  since  we  are  going  to  perform  the 
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Figure  1.  Functional  description  of  the  evaluation  required  in 
Equa 1 1  on  (2)  . 

multiplication  beginning  with  the  most  significant  digit  of  R. 
This  Is  very  important  because  we  can  then  begin  our  range 
transformation  algorithm  before  completion  of  the  evaluation 
of  (2).  Expressions  of  the  recoded  multiplier  bits  In  terms  of 
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f^e  variab'es  in  Figure  1  are  given  in  Appendix;  C.  These  involve 

s  i  inp  I  e  Circuits,  typ  i  ca  I  I  y  with  two-gate  delays. 

Since  we  wi  I  I  be  using  the  same  carry-save  adder  for  division 

recursion  as  for  the  multiplication  required  in  (2),  and  since  we 

w 1  1  be  pe rf orm i ng  the  mu  1 1 i p I  i cat ; on  most  significant  bit  f 1 rst , 

some  of  the  product  wi  I  I  be  shifted  off  the  left  end  of  the 

circuit  (divider  circuits  shift  left).  However ,  since  the  f i rst 

7  bits  of  X'*'  are  either  a  "0"  or  a  ”1,"  only  a  couple  of  bits 

will  be  necessary  to  determine  X""  . 

A  block  diagram  of  the  range  transformation  circuit  is  shown 

r  F  gure  2  The  PLA  begins  its  operation  after  a  signal  to  load 

divider  circuit,  "LD  DIV .  "  From  this  point  on  in  the  control 

section,  up  to  the  generation  of  the  ’’M"  values,  al  I  logic  Is 

combinatorial.  There  is  a  multiplexing  shifter  register  that 

r^rs  off  the  high  speed  clocks,  and  >  which  supplies  the 

'■ecoaed  multiplier  bits  to  the  buffer  driver  ( "BUF" )  controlling 

tne  selection  of  the  appropriate  value  of  X  (-2X,  -X,  0,  X,  2X) 

be  go  to  the  carry-save  adder.  After  the  "LD  DIV"  signal  has 

gone  i ow ,  this  multiplexer  starts  with  M  ^  =  0  as  an  input.  A 

c  I  ’'Cu  I  t  for  determination  of  ^^^2  given  in  Appendix  C, 

Figure  2a,  which  is  bu 1  It  from  simple  ful  I  adders.  The  PLA  is  a 

straightforward  5-lnput,  10-output  structure,  with  an  estimated  5 

gate  delays  associated  with  it.  We  have  performed  a  first  order 

analysis  of  it  using  a  standard  AND-OR  plane  approach  Using  an 

2 

est  mate  of  approximately  66  X  microns  per  cel  I  ,  where  X  is  one- 

n  a  '  the  feature  size  in  microns,  we  estimate  that  the  entire 

2 

sti'ucture  would  consume  an  11  x  11  mi  I  area.  Since  the  input  to 
ti'e  PLA  comes  from  high  capacitance  bus  I  ines,  there  would  be  no 
need  for  large  buffer  drivers  as  input  to  the  PLA 
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Figu'-e  2.  Block  diagram  of  range  transformation  circuit. 

In  Table  4  we  estimate  the  number  of  gate  delays  associated 
with  the  generation  of  M..  Assuming  a  32 -MHz  clock,  all  values 
of  n  should  be  available  when  required. 
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Table  4  Estimate  of  the  number  of  gate  delays  from 

t-ransf  ormat  i  on  initiation  unti  I  generation  of  M.  Here 
we  assume  a  32-MHz  clock  and  6-nsec  gate  delays 
(3^  NMOS) . 


1 
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ab  i 

i  ty  of 

1 

1 

1 

1  After 

Sta  r  t 

1 E 1 apsed 

1 

1 

1 

IT 

\  me 

1  Time 

1 

Clock  Cycle 

1  Ope  rat i on 

1  In  Gate  De 1  ays |  ( 
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1 

1 
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1 
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1 
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1 

81 

1  156 

IMult  Mif 

6 
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1  11 

J _ 

1/2 

1 

_ L_ 

69 

1  187 

1 

IMult  MT 

4 

After  generation  of  X  ,  which  is  now  in  carry-save  form,  it 


s  necessary  to  send  it  to  the  CPA,  where  it  will  be  transformed 
into  non- redundant  form  for  use  in  the  recursion  step  next. 


223  Division  Recursion 

The  basic  idea  in  speeding  up  the  recursion  operation  is 
to  wSe  a  '"edundant  number  representation  for  the  quotient  digits 
so  that  a  quotient  digit  can  be  selected  at  each  step  using  a 
m  '  ted  p'"ecision  estimate  of  the  partial  remainder.  This 
mp ' I es  that  it  is  not  necessary  to  perform  a  full  precision 
subt'^action  at  each  recursion  step,  thus  avoiding  the  time 
consuming  c a r ry -p ropaga t i on  across  the  entire  word.  Instead  a 
cai'ry-save  approach  can  be  used  along  with  a  small  CPA  circuit  in 
the  most  s i gn i f i cant  b i ts  ( typ i ca I  I y  6)  to  dete  rm i ne  the  I  i m i ted 
prec  sion  estimat,e  of  the  partial  remainder.  One  can  use  this 
app'^oach  and  obtain  g  ,  from 

I  +  1 

q^_^^  =  round  ( r  (R  [  i  ]  -  q^X’*))  (3) 
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wnere  R[i]  is  a  I  imited  precision  est  i  mat-e  of  the  partial 
remainder.  As  mentioned  in  the  introduction,  using  this  approach 
I s  comp iicated  in  that  it  requires  the  evaluation  of  q^X  One 
can  sacrifice  additional  precision  and  compute  from  the 

exp ress I  on 


di^l  =  round  (r(R[i]  -  q.))  (4) 

Wh  I  le  '4  I  is  easier  to  evaluate,  It  requires  more  complexity  in 
the  '"ange  transformation  due  to  a  decreased  value  of  a. 

Tne  value  of  a  Is  determined  by  the  degree  of  precision  in 
these  estimates.  There  are  various  tradeoffs  here  associated 
tne  cho'ce  of  precision  in  R[i],  which  are  explained  more 
u  /  Appendix  B.  These  tradeoffs  can  be  represented  by  the 
exP'^ess  on 


a  <  7--U-V  (1  -  (r-l)  |) 

-  r(r+l;  p' 


rt'^e^e  0  s  a  measure  of  precision  of  R[i]  and  q-  in  preaicting 


|r(R[']  -  q,:/  - 

n  ^  5  ^eOuCing  the  precision  of  R  (corresponding  to  Simp'er, 

■•^aste^  c  ^cuitryj  implies  an  increase  in  p  or  a  decrease  in  a. 

Tn. e  sma  er  va'ue  of  a  means  additional  precision  required  in  the 

^a'ge  transformation  process  described  in  Section  2.2,  Typical ly 

-8  -9 

f- e  /a  ue  a  wo  j  I  d  be  2  to  2 


Ae  !•  a  V e  loohed  at 
za  '  z  ^  a t  o ri s  based  o r^ 


a  variety  of  oasic  cirruits  for 

(4)  Here  we  describe  what  we  consider 


the 
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most,  efficient  implementation  of  these.  The  basic  scheme  is 
I i lustrated  in  Figure  3,  which  shows  the  flow  of  information 
through  the  divider  and  Figure  4,  which  provides  a  more  circuit 
oriented  representation.  In  Figure  3  it  can  be  seen  that  there 
are  two  para  I  lei  paths  of  computation  producing  the  two  results 
and  R[i-*-l]  •  The  set  of  computations  to  produce 

1.  Propagate  (assimilate)  carry  to  produce  R[i] 

2.  Subtract  q.  and  shift  to  produce  r (R [ i ]  -  q.) 

3.  Round  result  to  generate 

The  operations  in  the  other  path  are  as  fol lows: 

1  Select  appropriate  value  of  X  (q^X  )  as  input  to 
CSA 

2.  Perform  carry-save  subtraction  to  yield  R[i+1] ■ 

A  c  cc"'  diagram  of  the  operations  required  for  the  path  are 

I  '  i^sfated  in  Figure  5.  The  CPA  addition  can  be  carried  out 
using  a  "relay-type"  adder  as  shown  in  Figures  6  and  7.  This 
tyoe  adder  Is  relatively  simple,  yet  carry  propagation  is  fast 


since  carry 

1  n •'^o rma  t  i  on 

'  s 

propagated  by  transmiss 

1  on 

gates  . 

TDeta i 1 s  on 

this  adder 

are 

p rov 1 ded  i n  Ref . 

[7]  .) 

The 

carry- i n 

input  to  th 

1  s  CPA  requ 1 

res 

a  sma 1 •  amount  of 

1  og  i  c 

as 

desc  r i bed 

In  A  p  D  e  n  d 1  / 

B, 
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gu  ■'e  3 


Funct/ional  block  diagram  o 


recursion  operation  data 


and  the  remainder  of  the  P  values  are  simply  obtained  by  shifting 
the  R [ i ]  or 


(P_l.  Pq'  ^2'  ^3^ 


The  quotient  digit  is  obtained  by  rounding  P  if  it  is  smaller 
than  2  and  by  the  integer  part  of  P  If  It  is  equal  to  or  larger 
than  2.  This  corresponds  to  the  following  expressions  for  q 
as 


i  +1 


Q_^[i+i]  =  p_2R'  ^  RjR^  ^  RjRa  +  PiR2^3 


-2  1  T  2 

Qo['^l]  =  (P_2  ^  Rj)  (P_2^Pl)  (P2^P3)  (^2  ^  ^3^ 

wrene  Indicates  complement.  These  expressions  can  be  mapped 

'  r.  to  the  circuits  shown  in  Figure  8a-8c  ,  where  we  have  rewritten 
some  cf  the  switching  expressions  for  more  efficient  circuit 
me ‘ ementat I  on  As  can  be  seen,  the  subtraction  and  rounding  can 
ce  ce'‘'*^ormed  in  about  two  gate  delays 
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The  second  parallel  compu taT ion  path  involves  the  selection 
of  the  appropriate  va I ue  of  X  (e . g . ,  q.X  )  to  add  into  the  carry 
save  adder  in  the  bit  slice  array.  This  involves  the  use  of  a 
decoder  that  takes  as  input  and  then  drives  one  of  5  I ines, 
which  are  running  across  the  bit-slice  array.  These  lines  are 
connected  to  multiplexers  in  each  bit  slice  which  select  -2X,  -X , 
0,  X,  or  2X  as  shown  in  Figure  9.  The  ful  I  adder  cel  I  shown  in 
Figure  9  can  be  bu i It  as  shown  in  Figure  10,  a  circuit  presently 
used  in  our  S/P  multiplier.  We  estimate  that  the  total  gate 
delay  through  this  second  computational  path  to  be  about  6  gate 
delays,  4  for  the  q . X*  operation  and  2  for  the  ful I  adder 
ope  ration. 
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Figure  9,  Divider  bit-slice  cell  containing  X*  register, 

multiplexer,  full  adder  and  carry,  sum  register.  Here 
clock  superscripts  f  and  s  refer  to  fast  and  slow 
clocks 
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Figure  10.  Tentative  design  of  carry-save  adder  for  use  in 

divider  bit-slice.  Inputs  to  this  cell  are  sum  bit, 
S.,  carry  bit  Cj,  and  divi so  r  bit  X .  . 

In  Figure  9  we  see  that  the  value  of  X*  is  taken  from  Bus  B 
ar.d  oaded  into  a  latch.  This  result  comes  from  the  CPA  after 
tne  rsnge  transformation.  The  necessity  of  performing  a  CPA  to 
produce  a  non- redundant  X*  is  an  important  consideration  because 
tn , s  :s  a  relatively  slow  operation.  That  is,  after  range 
t,  ra  ns  f  o  rma  t  i  on  ,  the  carry-save  value  of  X*  must  be  transferred  to 
t-ne  CPA  by  the  slower  system  clocks,  fol  lowed  by  another  transfer 
pack  to  the  divder  after  the  actual  CPA  is  required.  As  shown 
r  Tabic  5,  which  tabulates  the  overall  division  speed  in  terms 
the  number  of  high  speed  clock  cycles,  assuming  the  high  speed 
clocks  are  four  times  faster  than  the  slow  speed  clocks,  this 
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round  trip  operation  to  get  a  non- redundant  X  is  considerably 
time  consuming.  For  this  reason  it  would  probably  be  more 
reasonable  to  Include  a  dedicated  CPA  internal  bo  the  divider  to 
avo‘d  this  delay.  Although  we  have  not  achieved  our  goal  of 
equal i z i ng  multiply  and  division  rates,  Table  5  indicates  that 
the  rates  are  comparable.  Some  time  is  saved  in  the  division 
operation  because  the  quotient  digits  are  converted  back  to  non- 
redundant  form  as  they  are  created,  so  that  a  final  CPA  is  not 
necessary  as  with  the  mu  1 1 i p I  ier. 


Tabie  5.  Division  and  multiplication  times  assuming  32  MHz  and 
8  MHz  arithmetic  and  system  clocks. 
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^ecurs'OR  loops.  We  also  ex-oec^  that  as  part  of  this  operation 
vire  can  perform  the  necessary  corrections  associated  with  the 
initial  normalization  of  X  required  before  beginning  the  range 
transformation  operation 

Based  on  the  previous  discussions  we  estimate  that  the  total 
number  of  gate  delays  per  recursion  wi 1 1  be  approximately  six. 
Assuming  a  3— micron  NMOS  technology,  corresponding  to 
approximately  5  nsec  per  gate  delay,  the  clock  speed  of  the 
arithmetic  units  would  be  approximately  32  MHz.  The 
corresponding  slow  or  system  clocks  would  then  operate  at  8  MHz 
Using  these  values  we  estimate  the  divider  speed  for  32-bit 
*ixed-point  operation  would  be  880  nsec .  We  com pa  re  this  speed 
to  data  available  on  other,  commercially  available  circuits  in 
Tab’e  1.  For  the  purposes  of  comparison  we  note  that  the  Hewlett 
Packard  Circuit  is  a  dedicated  divider  chip  based  on  a  fully 
para-iel,  recursive  implementation.  As  can  be  seen,  our 
tentat  ve  design  compares  very  favorably  in  terms  speed  with 

ai  the  other  mu  I t i p I  ler  circuits.  However,  f rom  the  standpoint 
^  area-time  product,  which  is  more  appropriate  for  VLSI 
imp  ementation,  the  comparison  is  eve'  more  favorable 

Th'S  divider  design  is  very  efficient  in  its  usage  of  area 
because  the  range  transformation  and  recursion  steps  share  the 
same  carry-save  adder.  This  adder  circuit  wi  I  I  be  comparable  in 
Size  to  that  used  in  our  S/P  multiplier  because  the  range 
t  ^a  n  s  o  r.ma  t  '  OR  and  recursion  steps  sl'iare  the  same  carny-save 
adder  Con-ipa  red  to  ou  r  S/P  mu  itipiier,  the  divider  wiM  have 
“'tra  CircU'try  associated  with  the  front -end  normalizing  circuit 
and  a  register  to  store  the  quotient  digits  as  they  are 
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generated.  In  addition,  more  area  will  be  required  outside  the 


b i t-s I  Ice  array  for  control  operations.  We  estimate  that 

2 

approximately  150  x  50  mi  I  area  would  be  necessary  for  a 
3-ti  NMOS  design  of  this  circuit. 
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3  0  Imp i ement I ng  the  SVD  Computation  Using  On-Line  Arithmetic 


3  1  Introduction 

On-line  algorithms  perform  arithmetic  operations  in  a  digit- 
serial  fashion.  The  operands  of  a  computation  come  in  one  digit 
at  a  time,  with  the  most  significant  digit  first.  After  a  small 
number  of  input  digits  have  arrived,  the  most  significant  digit 
of  the  result  is  generated,  and  thereafter  one  more  digit  of  the 
"■esuit  s  genenated  in  each  time  step.  On-line  algorithms  for 
add'  »./'on  / Subtraction,  multiplication,  division  and  sguare  root 
cpe^at  ons  have  been  developed  and  simulated  by  Watanuki,® 

K  a  g  “  a  V  e  n  d  a  and  Ercegovac,  Tr  ivedi  and  Ercegovac  and 
Or-oDdca  and  Ercegovac . 

9"-  ne  arithmeti"  algorithms  have  two  very  attractive 
^eat'^'es  The  first  feature  is  that  parallelism  is  achieved  by 
Ove^  app.ng  arithmetic  operations  at  the  digit  level.  A  short 
de  ay  after  the  first  digits  of  the  operands  have  arrived,  the 


Osaari  Watanuki,  "Floating-point  on-line  arithmetic  -^or  highly 
d'git-serial  computation:  application  to  mesh 
c-ob  ems,"  Report  No  CSD810529,  Computer  Science  Dept  UCLA 
•/ay,  1981 


9 


Ragkiavendra 
ar  I  thrrieti  c  ,  " 


and  M.  D.  Ercegovac,  "A  simulator  for  on- 
Proceedings  of  5th  Symposium  on  Computer 


hmet'c,  May  1981,  pp  92-98 


^  T  r  I  V  e  d  I  a  n  d  M  D  Ere  ego  vac,  "On-line  alaorithms  for 

d  S  on  and  mu'tipi  cation,"  IEEE  Trans  on  Computers, 

Vo  C-26,  N'C  7,  Ju'y  1977,  pp  681-687 

V  Cl  Oklobdzija  and  M  D.  Erce qo vac,  "An  on-line  square  root 
algorithm,"  IEEE  T  rans .  on  Computers,  Vo  I  C-31,  No  1 
Jar  1982,  pp  70-75 
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trst  digit  of  the  result  becomes  available  for  the  foil ow 1 ng 
coerations.  Thus  the  computation  of  the  operations  which  use  as 
I  rout  the  result  of  previous  operations  does  not  have  to  wait  for 
the  previous  ones  to  finish.  The  second  feature  of  on-line 
ar  thmetic  is  that  since  data  is  transmitted  in  digit  serial 
fasn  on,  communications  I i nes  are  needed  for  only  one  digit  for 
each  operand  mantissa  and  the  interconnection  requirements  are 
greatly  redu  .ed .  This  feature  makes  on-line  arithmetic  very 
attractive  for  VLSI  implementation  where  interconnection 
i-eau  rement  is  of  great  concern. 

In  the  fol lowing  we  discuss  some  of  the  Issues  Involved  in 
a^jp  y'hg  floating-point  on— i  ine  algorithms  to  a  processor  array 
' me  emendation  of  the  SVD  algorithm.  The  example  we  use  here  is 
fe  iVD  algorithm  by  Luk  ,  which  Is  Implemented  using  a 
t'  an9.,iar  processor  array.  For  details  of  the  design  and  the 
a  gor  tnm,  please  refer  t-o  the  aoove  mentioned  reference. 


3  2  Or-^ine  implementation  of  the  SVD  algorithm 

he  algorithm  by  Luk  has  two  phases.  In  the  first  phase,  the 
mat'  /  s  transformed  into  upper  triangular  form.  In  the  second 
^  ase,  ..^he  up;  per  triangular  matrix  is  di  agon  a  I  ized  using  2  —sided 
Jacob  'otations.  The  algorithm  is  performed  on  a  triangular 
array  processors.  The  requ  1 --ed  computations  for  each  node  is 
summar  zed  as  lows.  Suppose  each  processor  is  associated  with 

4  mat-r  ^  elements 

y  ’ 


r  Lu>',  "A  t'langula'-  processor  array  for  computing  the 
s  ^gular  value  decomposition,"  TR84-625 ,  Dept  of  Computer 
Jc  ence,  Cornell  Un i v . ,  July  1984 
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Triangularization  phase: 

diagonal  node  processor  (calculation  of  rotation 
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A  computation  tree  carr  be  developed  for  the  computation 
porf-rmed  in  each  process'or,  wr,  icn  shows  the  data  dependency 
Detwee"  operations  and  prov’des  an  easy  way  to  estimate  the  total 
on-,  ;  -,o  delay  o'f  the  comouta^  i  o.n  A  ompu  ta  t  ;  on  tree  the 

co-^o^tat  .  ons  of  the  d  >  agona  ;  .  c  at,  .  n,  [.^ase  fo^  a  diagonal 
Processor  node  is  shown  in  F  icjure  11 
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In  each  processor,  one  or, -  ine  arithmetic  unit  is  needed  for 
each  arithmetic  operation.  This  is  because  when  performing  on¬ 
line  arithmetic,  the  arithmetic  operations  are  highly  overlapped 
and  hence  the  hardware  cannot  be  shared.  Counting  up  the  total 
number  of  operations  performea  in  each  processor  for  each  phase, 
we  have  the  number  of  on- i ine  arithmetic  units  needed  in  each 
processor  for  each  phase  as  shown  in  Table  6. 


phase 

1  d i agona 1  node 

non-diagonal  node 

1 

1 

1  15 

12 

2 

1 

1  36 

1 

36 

Table  6.  Number  of  arithmetic  units  per  processor 


In  floating-point  on-line  arithmetic  operations,  there  are 
two  kinds  of  delays.  One  is  the  delay  introduced  by  the 
algorithms  themselves  which  is  called  the  on-line  delay  and 
denoted  by  5,  and  is  defined  as  the  number  of  operand  digits 
neeaed  for  the  algorithm  to  produce  the  first  output  digit.  This 
delay  is  affected  by  the  degree  of  redundancy  and  the  radix  of 
the  number  representation  system  used,  and  is  usual  ly  a  sma I  I 
integer  between  1  and  4.  The  second  kind  of  delay  is  that  due  to 
normal izatlon  of  the  operands  and  results,  and  is  variable.  For 
our  SVD  algorithm,  an  estimate  for  the  lower  bound  of  the  delay 
of  the  computation  would  be  that  caused  by  the  on- I ine  delay  of 
each  operation.  According  to  [13] ,  this  is 
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^r^ere  L  is  the  number  of  levels  of  the  computation  tree,  6.  is 
tre  largest  on-line  delay  of  the  level,  n  is  the  number  of 

c I g I ts  to  be  calculated  for  the  result,  and  t^  is  the  digit-step 
time  which  is  the  time  needed  to  load  the  input  digits  and 
compute  one  digit  of  the  result. 

33  Processor  organization 

Tne  architecture  of  the  on-line  arithmetic  processor  is  based 

13 

on  that  given  by  Gor j  i -S i nak i  and  Ercegovac .  Here  we  give  only 
a  brief  description. 

Each  processor  contains  storage  for  4  matrix  elements,  a 
gioba'  control  unit  (GCU),  and  a  number  of  on-line  arithmetic 
-h  ,  t,s  (OLU)  . 

Each  OLU  consists  of  an  exponent  unit  (EU)  and  a  number 
(aepenaing  upon  the  precision  requirement)  of  identical 
process i ng  el ements  (PEs) .  Each  PE  is  a  d i g i t-s lice  on- I i ne 
a'"’thmetic  unit.  The  structure  of  an  on- I  i  ne  division  unit  is 
Shown  ;n  Figure  12,  and  details  of  the  design  are  given  in  [13]. 
It  s  mentioned  therein  that  the  proposed  organization  is  capable 
of  pe'"'f^orming  add/subtract  and  multiply  with  minor  modifications 
and  no  increase  in  hardware.  It  can  readi ly  be  shown  that  an  on¬ 
line  division  unit  can  also  perform  the  on-line  square  root 
algorithm.  In  Figure  12  the  number  of  PEs  is  equal  to  the 
D'^ec'Sion  required  and  the  X  and  Y  buses  are  used  for 


13  A  Go  r  j  i-Sinaki  and  M.  D.  E  '•c  ego  vac  ,  "Design  of  a  digit-slice 
on-line  arithmetic  unit,”  Proceedings  of  5th  Symposium  on 
Computer  Arithmetic,  May  1981,  pp  72-80. 
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correction  f actors  applied  to  tne  square  root  and  division 
operations.  Also,  and  e  are  the  exponents 


Z-BUS  (TO  NEXT  ON-LINE  UNTT) 


Figure  12  Organ i zat  >  on  of  an  on- line  d i v i s i on  unit. 

For  each  of  the  data  elements  transmitted  between  processors, 
eacr  processor  needs  to  have  two  sets  of  interconnection  I ines; 
one  "'or  input  and  the  other  for  output.  This  is  because  the 
connections  cannot  be  shared  due  to  the  computation  overlap  in 
or-l i ne  arithmetic  Each  set  of  connections  for  a  single  data 
element  includes  I i nes  for  the  exponent,  for  one  digit  of 
mant,.ssa,  and  possibly  some  control  I  I  ne  for  signal  i  ng  the 
arrival  of  da ta . 

3 . 4  Remarks 

Fixed-point  vs  floating-point  arithmetic : 

In  fixed-point  on-line  arithmetic,  there  are  certain 
'■estr  1  ct  I  ons  on  the  domain  of  the  operands  to  guarantee  that  the 
error  of  the  result  is  w.thin  certain  range.  Hence  if  fixed- 
point  arithmetic  is  used,  the  operands  wi  I  I  have  to  be  converted 
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to  within  the  required  domain  before  each  arithmetic  operation 
starts  and  later  the  result  wi  I  I  have  to  be  converted  back  to  get 
the  correct  result.  The  range  requirement  may  be  different  for 
different  arithmetic  operations.  Thus,  fixed-point  on-line 
arithmetic  induces  substantial  overhead  and  is  not  suitable  for 
complex  algorithms.  In  floating-point  arithmetic,  on  the  other 
hand,  the  algorithms  always  produce  normalized  results,  assuming 
the  operands  are  normal i zed .  Hence  no  extra  work  is  needed  when 
a  number  of  on- I i ne  operations  are  cascaded. 

Computation  of  iterative  algorithms : 

When  using  on-line  arithmetic  to  implement  an  iterative 
algorithm,  the  usual  way  to  handle  the  Iteration  is  to  replicate 
the  hardware  for  the  number  of  iterations  desired.  However,  this 
is  net  always  necessary  since  usually  only  finite  precision  is 
rea-  red  for  the  computation.  In  the  case  of  the  d i agona I i za t i on 
p''ase  of  our  SVD  algorithm,  for  example,  the  total  on- I  i  ne  delay 
o"^  the  operations  performed  on  a  diagonal  processor  node  is  34 
a ! g ' t-steps  .  Each  iteration  includes  operations  in  the  odd- 
numbered  rows  of  processors  and  then  in  the  even-numbered  rows. 
Hence  the  time  between  the  start  of  consecutive  iterations  is  68 
Oig  t-steps.  If  the  required  number  of  digits  of  the  result  is 
less  than,  say,  60,  then  no  repi ication  of  the  processor  array  is 
needed.  Likewise,  doub I  i ng  the  array  wi  I  I  al  low  up  to  about  130 
pig  ts  of  prec'S'On. 
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The  d'git-step  delay  ■. 


The  digit-step  time,  which  is  the  time  needed  to  compute  a 
Single  digit  of  the  result  of  an  arithmetic  operation,  depends  on 
the  time  needed  for  exponent  calculation,  for  execution  of  the 
recursion  formula,  for  loading  the  Input  digits,  and  for 
performing  the  output  digit  selection  process.  The  exponent 
calculation  involves  only  fixed-point  addition/subtraction,  and 
we  assume  that  this  requires  less  time  than  that  required  by  the 
mantissa  digit  computations  So  the  digit-step  time  is  basically 
determined  by  the  time  needed  for  the  basic  recursion  formula  and 
the  digit  selection  process  An  analysis  of  the  digit-step  time 
s  given  in  [13],  and  we  have  the  foil ow ing  formula 


Tstep  =  [8I<  Tcei  I  inglog^C^  +  1)  +  24]  6^ 


whe'"e  the  radix  is  r  =  2  and  6^  is  the  gate  delay  For  k  =  2  we 
note  that  tne  time  step  wi  I  I  be  greater  than  40  times  the  gate 
de : ay  This  value  is  based  on  the  time  required  to  perform  a 
c a r -y -p ropaga te  addition  across  some  I i m i ted  number  of  bits, 
wh'cr  depends  on  the  selection  rules  and  the  precision 
Typica' 'y,  it  might  Involve  8  -  9  bits  for  division.  It  iS 
important  to  minimize  the  number  of  delays  compared  to  that 
associated  with  ful I  carry  propagation  in  conventional 
arithmetic  For  this  reason  we  feel  that  less  than  10  gate 
delays  would  be  a  minimally  all  owed  value,  and  therefore  more 
worK  rs  required  in  this  regard  We  note  also  that  the  number  of 
gate  decays  per  time  step  is  a  function  of  the  arithmetic 
operation  Hence,  the  time  step  associated  with  the  slowest 
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ODe^atiC’^  A  aete^’^  '-e  -'e  i^me  step,  '■esulting  in 

‘^e'^'ficie^t  ^sage  c'f  na'-OAd^e 

On-iit'-e  arit-hmet  c,  cznnare;^  t-:  conventional  arithmetic,  has 
tne  advantages  reduced  r-inn^'-.cat.on  requirements  and  highly 

mod^  ar  ana  >jr:*crm  impieme''ta*'  or  ^4^  These  advantages  make 
or-;  ne  aritrmet  c  rignly  taole  '^or  LSI/VLSI  implementation. 
Tne  processor  gzn  \  zbX.  ^  or.  mentioned  ^n  this  paper  al  lows 
rep  I  icating  the  basic  PEs  to  ^orm  the  OLU ,  each  capable  of 
oe^^orming  some  arithmetic  operation,  and  then  putting  together  a 
number  OLUs ,  along  with  otner  components  such  as  GCU ,  storage 
devices,  etc  ,  tc  •form  a  processor.  This  design  is  quite 
f  ex  1 b i e  and  it  is  straightforward  to  design  processors  for  other 
computations 

Vcre  study  is  needed  for  problems  such  as  the  accumulative 
er-o'  oenavior,  on-i  i  ne  algorithms  for  compound  operations,  and 
wavs  to  nandle  the  variable  delay  caused  by  both  normal ization 
a  d  data  romin  q  thi-ougn  paths  o*  different  length  in  the 
:  ,  ~  c  _  t  a  t  i  c  tree 

L.  a '  '  2  E  -  r  t  Ope  rat  I o  n  s 

f:'  a  s'de  float-ng  point  on-line  arithmetic  operation  it 
s  rot  necessary  to  worry  about  the  problem  of  normal ization  of 
operations  fie  ,  for  subtraction)  because  the  on- I i ne  unit  can 
deal  with  circumstances  wfiere  cancellation  occurs  in  the  most 
gr  cant,  t,  >  '  r.  Hr,*eve'.  foir  multiple  on-line  units,  as  for 
e/ampie  iri  f*"  "VD  aigC'r  thrr.  ^  tfiis  delay,  which  is  variable, 
't^odu'es  a  s  c e  d  j  i  p  n  g  problem  T  t"' I  s  must  be  handled  with  some 
app-'cr.  r  ate  "-a^d-sriahirig  technique,  data  buffering  or  data-driven 
type  ope'-at'on,  wnich  incurs  e»tra  hardware  overhead. 
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Compound  Operations 


There  are  two  basic  approaches  to  implementation  of  an 
algorithm  by  on- I i ne  techniques  The  straightforward  approach, 
on  which  we  will  base  our  analysis,  uses  a  separate  on-line  unit 
for  each  arithmetic  operation  (add,  multiply,  etc.),  each  with 
Its  own  set  of  selection  rules.  An  alternative  approach  which  we 
rave  not  pursued,  is  to  build  an  on- line  unit  for  an  entire 
compound  unit  with  its  own  selection  rules.  The 
advartages/d i sadvantages  of  this  approach  remain  to  be 
aeterm i ned  Howe  ve  r,  there  are  poss ibilities  that  properly 
crcsen  primitives  could  provide  ways  of  avoiding  scaling  problems 
fc"  * i xed  point  operations  and  synchronization  problems  for 
f ! oao  ng  point  operations. 
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4  0  Algorithms  for  Square  Root  and  for  Tangent  to  Cosine 
Con  ve  rs i on 

4.1  Introduction 

A  step  in  the  conversion  of  a  coefficient  matrix  to  upper 
triangular  form,  by  the  method  of  Givens  rotations,  involves 
comput I nc  the  sine  and  cosine  of  a  vector  angle  from  the  values 
of  its  X  and  y  coordinates.  The  obvious  approach  evaluates  the 
relations 


cos  6 


,2)1/2 


and  sin  & 


1 _ 

2x1/2 

y  ) 


Or  a  fixed  point  computer,  loss  of  accuracy  can  result  from  the 
squaring  of  x  or  y  in  cases  where  these  values  are  sma I  I  .  Also 
the  conventional  algorithm  for  extracting  the  square  root  is  time 
consuming  and  becomes  a  bottleneck  in  the  speed  of  processing. 

Tn I s  report  d i scusses  a  number  of  a  I gor 1 thms  wh i ch  si mp I  i f y 
the  square  root  process,  and  other  algor  i thms  which  avoid  it 
altogether  in  the  computation  of  sine  and  cosine. 


42  Square  root  by  polynomial  approximation  and  iterative 

correction 

Approximating  the  function  over  the  entire  range  of 

possible  arguments  is  quite  impractical .  The  problem  can  be 
s i mp I  1 f i ed  by  recogn I z I ng  that  the  argument  need  on  1 y  range 
oetween  0  25  and  1.00,  since  any  positive  number  can  be  put  in 
tcis  range  by  repeated  shifts  of  two  binary  places  each  After 
tre  square  root  is  found,  it  can  then  be  restored  to  its  correct 
range  by  an  equal  number  of  shifts  of  one  binary  place  each. 
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Pc  vooT  a  approximations  to  the  function 


R  =  yFT  0  25  <N<100 

car  De  found  by  I ibrary  algor. thms  which  compute  the  coefficients 
of  the  least-squares  best  fitting  po lynomial  when  given  a  set  of 
points  lying  on  the  desired  curve  Starting  with  straight  line, 
paraboi ic,  or  cubic  approximations,  the  accuracy  increases 
roughly  by  a  factor  of  4  for  each  increase  in  the  order  of  the 
poi yrom  a  I  up  to  the  point  where  the  computation  deteriorates  due 
ho  error  In  genera  ,  the  square  root  cannot  be 

ca  ated  to  the  ful I  accuracy  of  a  given  computer  by  using  a 

r  I  gr  crqer  poynomial  app  rox  i  r,a  1 1  on  .  A  better  approach  is  to  use 
a  owe"  oraer  polynomial  to  provide  a  "first  guess"  input  to  a 
Newpor-Raphson  iteration  formula 


w  r  e  I"  e 

N  IS  the  given  number. 

R^  IS  the  first  approximation  to  the  desired  root. 

R^  IS  the  seconq  approximation  to  the  desired  root 

The  iteration  can  be  applied  several  times  Each  application 
wi ! I  double  the  number  of  accurate  places  in  the  result  unti I  the 
I  m  ■  t  set  by  the  compute''  register  size  is  reached  A  tradeoff 
must  be  made  between  the  order  of  the  polynomial  and  the  number 
of  I te  ra t  ions  required  txO  bring  t/he  results  Lo  ^uM  3ccuracy- 
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Fu  <  accuracy  or  a  32-bit  comput-er  can  be  had  from  a  straighb 
'ine  approximation  followed  by  3  iterations,  or  f  rom  a  pa  rabo I  i c 
approx  mation  fol lowed  by  2  iterations.  Considerations  involved 
in  making  a  choice  will  be  discussed  in  a  later  section. 

4.3  Tne  reciprocal  square  root 

The  algorithms  of  Section  4.2  can  be  used  wherever  the  square 
root  of  a  number  is  required.  In  the  case  where  the  result  Is  to 
De  useb  as  a  divisor,  such  as  in  the  formula 

®  =  7^ - ^2-T72 

(x  my) 

it  m  gnt  be  advantageous  to  use  an  approximation  to  the  function 

R  =  -1 

v/N 

foi  owed  by  iterations  using  the  formula 


R,  =  r3-N*R  *R)*R/2 

1  '  o  O"^  o 


It  s  '^Ound  that  attaining  32-bit  accuracy  requires  using  a  4th 
orde*"  polynomial  fol  lowed  by  two  iterations.  The  storage 
required  for  two  additional  coefficients  and  the  two  additional 
multiplications  required  In  each  step  more  than  offset  the 
advantage  ga i ned  by  el i m i nat i ng  one  division 
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44  Aigor  tnm  "pythag" 

14 

Mo  I  e  r  and  Morrison  present,  an  a  I  gor  i  thir'  which  computes 
2  2  1/2 

f  X  *  y  )  '  directly  ■from  x  and  y  without  taking  a  square  root. 

Initial  t  2  e : 

P=max  (|x|,  |y|) 

Q=min  (|x|,  lyj) 

I te  rate : 

9  2  9 

S  =  Q^/(4P  +  Q^) 

P  =  Pt2*Sx.p 

Q  -  S  .  Q 

After  3  iterations,  P  will  contain  the  desired 
result  to  the  ful I  accuracy  of  a  computer  with 
less  than  60  bits 

The  authors  state  that  Pythag  is  potential ly  faster  than 
Newton-Raphson  iteration  for  the  square  root  because  it  is 
Cub . ca I ly  convergent  compared  with  quadratic  convergence  for 
Newoon-Raphson  iteration  This  might  be  true  if  the  result  were 
wanted  to  hundreds  of  places,  but  in  real  situations  tne 
aovantage  of  Pythag  is  lost  because  of  the  greater  amount  of 
computation  required  per  step  of  iteration. 

45  Tangent  to  cosine  conversion 

The  use  of  Givens  rotations  for  matrix  triangulan  zation  does 

2  2  1  /2 

^ct  nioqij  i  re  the  eva  I  uat  ‘  on  of  (x  *  y  )  '  '  f  some  more  d  i  rect 

14  "Replacing  Square  Roots  by  Py  th  ago  rear  Sums,"  Cleve  Mo  I e  r  & 
Donald  Morrison,  IBM  J  Res  Develop  -  Vo  I  27,  No  6, 

Nov  83 ,  p  577 . 
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car  De  found  oo  compute  s  ■  ne  and  cosine  from  the  value  of  tne 
tangent . 

We  have  found  that  this  can  Pe  done  by  a  variation  of  the 
method  described  in  Section  4  2  The  method  involves  the  use  of 
a  polynomial  approximation  po  the  function 


cos  9  =  f  ("tan  9 ) 


*  o  i  owed  by  one  or  two  steps  of  iteration  using  the  formulae 


U  =  +  (C  T)^ 

O  '  o 


w  e  e  . 

T  =  the  given  value  of  tan  6 

=  the  first  approximation  of  cos  0 
=  the  se  :ond  approximation  of  cos  9 
The  value  of  sin  6  s  obtained  from 

s I n  0  =  COS  6  tan  9 

I ne  practical i ty  of  this  method  depends  on  the  observation 
that  the  tangent  need  only  range  from  0  to  1  in  the  polynomial 
apc  rox  I  mat  I  on  of  cos  ('arctan  6).  If  y  is  greater  than  *,  these 
arguments  need  only  be  interchanged  and  the  routine  wi i I  compute 
Sin  9  from  cotan  9.  The  cosine  is  then  found  from 

cos  6  =  sin  9  cotan  9 
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e  ■■  c  o  t  1  n  e  i  s  s  cim  e  w  n  a 

t  C  C  m  p  i 

cated  by  the 

"■eau 

r 

e  m  e  n  t  to 

save  the  signs  of  x  and  y 

and  to 

1 nd  the  abso ' 

u  te 

va 

1 ues  of  X 

and  y  for  use  in  the  remai 

n d e  r  of 

the  computat' 

on 

A 

compa  r 1  son 

t,hese  absolute  values  then  aet.ermines  if  an  interchange  is 
needed.  A  flag  must  be  set  to  -ndicate  that  the  interchange  must 
De  undone  at  the  end  of  the  computation.  The  oniginal  signs  of  a 
a'^d  V  are  then  applied  directly  to  cos  6  and  sin  9  respectively. 


^  9 

4  b  lan  to  cosine  conversion 

It  rav  be  possible  to  design  chip  hardware  which  will 
determ  ''e  the  greater  of  *  and  y  in  the  absolute  sense  without 
*  "st  computng  absolute  values  of  x  and  y.  If  this  is  possible, 
some  tme  will  be  saved  in  the  execution  of  the  routine.  A 
co'’seduen;e  is  that  a  s  '  gn  wi  i  !  now  be  attached  to  the  computed 
va  ue  c*  tan  i?  The  routine  can  be  made  independent  of  this  sign 
c_y  js  '0  a  polynomial  approximation  of 


o 

cos  6  =  .  tarT  9 ) 


■7- 

1 

^  tv  e  ^  a 

tior  steps  w  M  not  red u. re 

mod 

i  f  i 

cat  1 

Q  n  s 

n  0  e  tan  9 

4^  " 

^  e  ^  s  a  b 

the  sduare  iri  the  formu  a 

The  s 

1  gns 

of 

>  a  r.  d  y  m  u  s 

s  ^ 

C  *5 

saved  for  r  e  s  t  o  a  t  i  o  !■ .  r  o  c  o  s 

9 

and 

s  1  n 

9  . 

As  before, 

tn 

e  a  Q6 

2 

of  tar  ^  9  need  only  be  * rom 

0  to 

1 

4.7  Summary  of  Previous  Algorithms 

1.  Polynomial  coefficients 
a.  Square  root 

First  degree  polynomial 
=  11/16  N  +  11/32 

—8 

With  2  iterations,  error  =  6  x  10  rms 

- 14 

With  3  iterations,  error  ~  10  rms 

Second  degree  polynomial 

R  =  -0.316394414  +  1.052146819  N  +  0.259248366 

o 

With  2  iterations,  error  =  6  x  10  rms 

2.  Reciprocal  square  root 

Third  degree  polynomial 

R  =  -2,439288044  +  6.232249739  -  5.912738903  N  + 

o 

3. 112778043 

-9 

With  2  iterations,  error  =  6  x  10  rms 

3.  Tangent  to  cosine  conversion 

Sixth  degree  polynomial 

C  =  0.114580644  T®  -  0.453009177  T^  +  0  593659040  t"^  - 

O 

0.054354432  T^  -  0.493471808  T^  -  0.000296313  T  i-  1.000001911 
With  no  iterations,  error  =  1 . 597E-6  rms 
With  1  iteration,  error  =  6.105E-12  rms 
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2 

4  Tan  to  cosine  conversion 

Fifth  degree  polynomial 

C  =  -0.031263642  +  0,128727929  -  0.254304918  + 

o 

0  362943041  -  0.498991485  T  +  0  999983578 

With  no  iterations,  error  =  7  678E-6  rms 
With  1  iteration,  error  =  1.353E-10  rms 

4.8  Estimated  run  times 

In  this  section  each  algorithm  is  I i sted  as  a  series  of 
steps,  starting  in  every  case  with  the  values  of  x  and  y  and 
ena i ng  with  the  output  of  sine  and  cosine  The  estimated  number 
of  high  speed  (arithmetic)  clock  cycles  for  each  step  Is 
tabulated  and  the  total  elapsed  time  is  given.  A  summary  of 
results  if  given  here: 


algorithm 

1  a  Square  root  with  1st  degree  polynomial 

P  Square  root  with  2. id  degree  polynomial 

c  The  above  with  2  ALUs 

2  Reciprocal  square  root  with  4th  degree  polynomial 
A  I gor i thm  PYTHAG 

3  Tan  to  cos  with  6th  degree  polynomial 

2 

4  Tan  to  cos  with  5th  degree  polynomial 


CLOCK 

CYCLES 

83 

77 

60 

101 

167 

103 

101 


It  'S  evident  that  lb  and  Ic  provide  the  shortest  run  times.  If 
less  of  accuracy  due  to  squaring  of  x  and  y  is  found  to  be 
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significant,  then  algorithms  3  and  4  can  be  considered.  An 
example  of  the  actual  operations  performed  for  lb  is  shown  in 
Figure  13. 


I oad  X  1 

square  x  5 

store  x^  1 

I oad  y  1 

square  y  5 

add  x^  1 

shift  to  range  2 

store  result  (N)  1 

store  shift  count  1 

mu  I t  by  a  5 

add  b  1 

mu  I t  by  N  5 

add  C  1 

store  result  (R  )  1 

I oad  N  2 

d i V i de  by  R  16 

o 

2  t i mes  add  R  2 

o 

shiftlplace  2 

store  result  (R^)  1 

sh i ft  to  range  2 

store  result  (x^  +  =  0  1 

load  X  1 

d i V i de  by  D  8 

store  result  (cos)  1 

I oad  y  1 

d i vide  by  D  8 

store  result  (sin)  1 

77 

F  gure  13  Steps  required  to  perform  square  root  (method  lb) 
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4.9  "Best*'  algorithm  for  computing  (x 


4- 


HYPOT" 


This  can  be  written  as; 


=  x(l 


2  1/2 

nJ 


or  as 


=  x(l 


On  a  fixed  point  computer,  the  first  form  results  in  loss  of 
accuracy  if  x  and  y  are  sma I  I  numbers.  For  example,  on  a 
computer  with  30  bits  to  the  right  of  the  binary  point,  numbers 
which  occupy  only  the  last  15  bits  wi  I  I  become  zero  when  squared. 
The  second  form  avoids  this  error  by  performing  the  division 
before  the  squaring.  If  only  one  number  is  sma I  I  ,  accuracy  wl  I  I 
be  improved  if  this  number  is  made  the  numerator  of  the  fraction. 
This  IS  done  in  the  first  steps  of  the  algorithm  by  taking  the 
apsO'Ute  values  of  x  and  y,  comparing  them,  and  performing  an 
excnange  if  required.  The  algorithm  as  described  thus  far  can  be 
^sea  with  a  conventional  square-root  procedure,  and  constitutes  a 
vai ‘d  methoo  for  avoiding  the  sauaring  error, 

Tne  running  time  of  the  algorithm  can  be  shortened  by 
el 'm  nating  the  conventional  sauare  root  and  finding  the  result 
by  an  iterative  procedure.  Given  the  expression: 


r 


(1 


1/2  2 

we  w:  I  I  compute  ( 1  +  U)  '  where  U  =  (y/><)  The  argument  U  wi  I  I 

range  only  from  0  to  1  if  the  comparison  and  exchange  of  x  and  y 

have  been  performed  as  described  above  The  function 

1/2 

f^U;  =  (1  +  U)  '  can  be  computed  by  Newton-Raphson  iteration: 
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where  =  a  first  guess, 

f  =  the  improved  value. 

The  iteration  can  be  repeated  any  number  of  times,  and  at 
each  step  wi  I  I  double  the  number  of  accurate  bits.  The  most 
efficient  algorithm  results  If  the  iteration  Is  performed  only 
once  In  our  example  this  requires  a  first  guess  accurate  to 

15  bits  to  give  a  final  result  good  to  30  bits. 

1  /2 

The  function  (1  +  U)  '  is  a  smooth  function  which  can  be 
we i i  represented  by  a  polynomial  approximation  of  relatively  few 
terms  The  coefficients  of  the  polynomial  are  computed  by  the 
least-squares  procedure.  This  computation  is  done  only  once  and 
is  not  part  of  the  algorithm.  The  resulting  coefficients  are 
stc'ea  in  permanent  memory  for  use  in  the  algorithm.  The  4th 
order  polynomial  is  adequate  to  give  17-bit  accuracy.  The 
polynomial  coefficients  are 

rms  error : 

ORDER  2 

-0.070215334983 
0  482062943591 

1  001323600003  4.701  E-4 

ORDER  3 

0.024014900953 
-0 . 106237686413 
0 . 496400319759 

1.000158637158  5  125  E-5 

ORDER  4 

-0.010281085549 
0.044577072051 
-0  ■ 19412897544 
0. 499294445341 

1.000020416187  6  240  E-6 
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ORDER  5 


r'ms  error 


ORDER  6 


ORDER  7 


ORDER  8 


0.004933240713 
-0  022614187331 
0.055513244505 
-0  123484054442 
0  499864663374 
1 . 000002729927 

-0.002537302797 
0.012545149104 
-0.031247590762 
0  060093537381 
-0 . 124615400309 
0.499974514201 
1  000000373098 


0.001367498511 
-0.007323547587 
0 . 019161948440 
-0.035823977127 
0.061743633664 
-0  124907280763 
0.499995274613 
1.000000051632 


-0.000762269076 
0.004416574816 
-0.012297231345 
0.023411232648 
-0.037853839273 
0  062278473297 
-0  124978526529 
0  499999135213 
1  000000007192 


8  124  E-7 


1 . 106  E-7 


1 . 556  E-8 


2  242  E-9 
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ana  t-he  algo'^ithm  steps  are  (4tri  degree  polynomial) 


I  oad  X 
abs  va I ue 
I  oad  y 
abs  value 
compa  re 

exchange  (if  required) 
save  abs  x  (or  abs  y) 
Divide 


square 

save  resu I t  (U) 
mu  1 1  by  a 
add  b 
mult  by  U 
add  c 
mult  by  U 
add  d 


mult  by  U 
add  e 

save  resu  I  t  (f /^) 

I oad  U  (sh i f tea) 

add  0.5 

d i V i de  by 

add  fp  (shifted) 

mult  by  abs  x  (or  abs  y) 


Algorithm  HYPOT  computes  (x^  +  y^) with  30-bit  accuracy. 

An  essential  feature  is  the  division  of  y  by  x  before  squaring, 
whicn  avoids  the  loss  of  accuracy  associated  with  the  squaring  of 
sma i I  numbers.  The  possible  interchange  of  x  and  y  at  the 
beginn  ng  of  the  algorithm  does  not  require  the  setting  of  a 
logical  flag,  since  the  operation  does  not  need  to  be  "undone"  at 
the  end  of  the  algorithm.  There  is  no  need  to  scale  and  unscale 
numbers  except  for  two  simple  one-place  shifts  which  do  not  need 
to  be  undone . 

Algorithm  HYPOT  may  be  compared  with  algorithm  PYTHAG  [14] 
which  s  advocated  by  two  authors  at  IBM  Both  accomplish  the 
same  result  but  PYTHAG  requires  147  clock  cycles  compared  with  65 
eye ' es  for  HYPOT  This  corresponds  to  13  multiply  times,  only 
i  X  times  slower  than  if  one  had  a  hardwired  square-root  circuit 
as  fast  as  a  multiplier. 
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APPENDIX  A 


DIVISION  SCHEMES  WITH  SIMPLIFIED  SELECTICN'  i 
AND  PREDICTION  OF  QUOTIENT  DIGITS 
Milos  D.  Ercegovac 
August  3,  1983 


Report  No.l 


1.  Introduction 

In  a  previous  report#  a  paper  presented  at  the  6th  IEEE  Sym¬ 
posium  on  Computer  Arithmetic  [ERCE833 ,  a  general  division  scheme 
was  presented,  based  on  a  divisor/dividend  transformation  tech¬ 
nique  such  that  the  selection  of  the  quotient  digits  can  be  per¬ 
formed  by  simple  rounding. 

In  this  report  we  elaborate  on  the  implementation  and  per¬ 
formance  aspects  of  a  radix-4  variant.  Of  particular  interest  is 
the  fact  tnat  the  next  quotient  digit  can  be  obtained  in  parallel 
with  the  next  remainder  computation. 

The  discussion  and  results  discussed  here  are  preliminary 
ana  require  further  refinement. 

2.  Divisor  and  Dividend  Transformation 

v.'e  follow  closely  the  results  from  [ERCE83J  in  this  deriva¬ 
tion. 
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SCUD0  SO 


The  scaled  remainders  for  the  transformation  are  defined  as 
“k  ' 

where  Xq  =  X.  We  want  that  IXp  -  II  ^1/64  or,  equivalep^ly ,  that 
^  1/64.  Assuming  that  iDpj  ^1,  p*'4. 


The  expressions  for  the  transformation  are: 


D.  «  X-  -  1  a 

2Xq  -  1 

if  Xq  <  0.75 

1  "1  ^Xq  >  1 

That  is,  Sg  4  {0,1}. 

Equivalently, 

otherwise 

r'  * ' 

if 

=  1 

°2  *  f ‘>1 

if 

=  0 

1 

(N 

Q 

If  s,  = 

-1 

D,  =  4Dt  +  St  +  S-Dt/4 
j  I  2.  ^2 
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The  transformed  divisor  is 

X*  =  X.  =  D.4"^  +  1 
4  4 

The  initial  dividend  is  transformed  using  the  following  recur¬ 
sion: 


^k+l  =  k.0,1,2,3 

e£  and  S3 

The  selection  intervals  are  determined  by  evaluating 
“k  - 

for  =  dmax/dmin  and  all  values  of  Sj^  *  -2, -1,0, 1,2.  Assum¬ 
ing  -0.99  <  <  0.99  we  Obtain  the  following  intervals: 


dmin  =  -0.99,  dmax  =  0.99 

Selection  Intervals  for  k=  3 


s  = 

-2, 

dmin  = 

0.2606452, 

dmax 

* 

0.7716129, 

delta  = 

0.7716129 

s  = 

-1, 

dm.in  = 

0.0025397, 

dmax 

= 

0.5053968, 

delta  = 

0.2447517 

s  = 

0, 

dmin  = 

-0.2475000, 

dmax 

s 

0.2475000, 

delta  = 

0.2449603 

s  = 

1, 

dmin  = 

-0.4898462, 

dmax 

= 

-0.0024615, 

delta  = 

0.2450385 

s  = 

2, 

dmin  = 

-0.7248485, 

dmax 

= 

-0.2448485 , 

delta  = 

0.2449977 

dm in=-0 .7248484848,  dmax=0 . 771 6 129032 

Selection  Intervals  for  k=  2 
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s  = 

-2, 

dm  in  = 

0.3643290, 

dmax  = 

0.7918894 , 

delta  = 

0.791«894 

s  = 

-1, 

dm  in  = 

0.0733737, 

dmax  = 

0.4724301, 

delta  = 

0.1081011 

s  = 

0, 

dmin  = 

-0.1812121 , 

dmax  = 

0.1929032, 

delta  = 

0.1195295 

s  = 

1, 

dmin  = 

-0.4058467 , 

dmax  = 

-0.0537381, 

delta  = 

0.1274740 

s  = 

2, 

dmin  = 

-0.6055219, 

dmax  = 

-0.2729749, 

delta  = 

0.1328718 

dmin=-0.6055218855,  dmax=0 .7918894009 

Selection  Intervals  for  k=  1 


s 

= 

-2, 

dmin  = 

0.6972391, 

dmax 

* 

1.3959447, 

delta 

= 

1-3959447 

s 

= 

-1, 

dmin  = 

0.1314927, 

dmax 

* 

0.5972965, 

delta 

-0.0999426 

s 

= 

0, 

dmin  = 

-0.1513805, 

dmax 

0.1979724, 

delta 

ax 

0.0664796 

s 

= 

1, 

dmin  = 

-0.3211044, 

dmax 

-0.0416221, 

delta 

s 

0.1097584 

s 

s 

2, 

dmin  * 

-0.4342536, 

dmax 

-0.2013518, 

delta 

0.1197526 

dmin*-0.4342536476,  dmax=l .3959447005 

The  overlap  is  indicated  by  "delta".  A  set  of  selection  rules  is 
given  next.  In  these  rules,  d  and  s  denote  the  corresponding 
ana  respectively. 

Select  S^ 

if  (d<=-0.1)  s  =  1; 

else  if  ( (d>-0.1) & (d<=0 .165) )  s  =  0; 

else  s  =  -1; 

Select  S2 

if  (d<=-0.33)  s  =  2; 
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else  if  ( (d>-0.33) & (d<=-0.1) )  s  =  1; 

else  if  ( (d>-0.1) & (d<=0.1) )  s  *  0; 

else  if  ( (d>0.1) & (d<=0.39) )  s  =  -1; 

else  s  =  -2; 

Select  $2 

2 
1 
0 
1 
2 


If  (d<=-0.36)  s  = 

else  if  ( (d>-0.36)&(d<=-0.12))  s  * 

else  if  ( (d>-0.12) & (d<=0.12) )  s  * 

else  if  ( (d>0.12) & (d<=0.36) )  s  »  - 

else  s  *  - 


3.  Mam  Recursion  with  Quotient  Digit  Prediction 

Once  the  divisor  and  the  dividend  are  transformed  into  the 
required  range,  we  apply  the  following  recursion  on  the  partial 
remainders. 

+  signKj^*l/2j 

^1+1  *  4(R.  -  q.X*) 

* 

where  Rq  =  Y  . 

A  direct  implementation  of  this  recursion  would  require 
three  substeps: 

(i)  Select  q^, 

* 

(ii)  Generate  q  and 

X  A 
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(iii)  Compute  + 

However,  it  is  possible  to  overlap  the  step  (i)  with  steps  (ii) 
ana  (iii).  Assume  that  is  known.  Then,  define  the  recursion 
as 


where 


Ri.l  -  4(R,  -  q,X*) 

“Si+I  ' 


(>1  i  qj 


otherwise 


Therefore,  the  recursion  step  contains  only  two  substeps  instead 
of  three: 


Compute  Ri  +  1  I - I 

Compute  qi+l  I - 1 

I  I 

Compute  qi+lX*I  I 


--y  I  < - I - >  I  < 


Step  i-l  Step  i  Step  i+l 
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The  overall  tia.ir.g  of  the  main  recurEion  would  looK  like 


I  R1  I  I  R2  1  R3  I  ...  I  Ri  !  Ri+1  I  ... 

!  ql  I  q2  I  q3  I  . . .  I  qi  I  qi+1  I  ... 


4.  A  Complete  Radix-4  Algorithm 


We  give  a  C  version  of  the  complete  radix-4  division: 


tdefire  M  16 

♦define  X  0.5 

♦define  Y  0.07401786542 

♦define  R  4 

♦define  K  1 

main  () 

{ 

double  xO,  yO ,  dl,  yl ,  d2,  y2,  d3 ,  y3,d4; 
double  quot,  power; 
float  r; 

double  ertf  xprime,  yprime,  rem,  remnext; 
int  i,  q,  qnext,  si,  s2 ,  s3,  m; 

xO=X;  yO  =Y;  ra=M;  r=R; 

/*  Step  0  */ 

if  (xO  <  0.75  ) 

{ 

dl  =  2.0*x0  -  1.0; 
yl  =  2.0*y0; 

} 

else 

{ 

dl  =  xO  -  1.0; 
yl  =  yO; 

} 

/  *  Step  1  */ 

si  =  selone (dl ) ; 
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d2  =  r*dl  +  si  +  sl*dl; 
y2  =  yl*  (1  +  sl/r  ) ; 

/*  Step  2  */ 

s2  =  seltwo (d2) ; 

d3  =  r*d2  +  s2  +  s2*d2/r  ; 
y3  =  y2* (1  +  s2  /  (r*r)  )  ; 

/*  Step  3  */ 

s3  =  seltre  (d3)  ; 

d4  =  r*d3  +  s3  +  s3*d3/(r*r); 

yprirae  =  y3*(l  +  s3  /  ({r*r)*r)); 

xprime  =  d4/((r*r)*r)  +  1; 

quot  =  0; 

power  =  1.0; 

rem  =  yprime; 

if  (rem  >  0.0  )  q  =  rem  +  0.5; 
else  q  =  rem  -  0.5; 

/*  Recursion  */ 

for  (i  =  1;  i  <  m+1  ;  ++i  ) 

{ 

remnext  =  r*(rem  -  xprime*q) ; 

qnext  ®  select (rem,  q,  xprime); 

quot  =  quot  +  q*power; 

err  =  yO/xO  -  quot; 

power  =  power/ r; 

q  =  qnext;  rem  =  remnext; 

} 

} 

/*  Select  si  */ 

selone  (d) 
douDle  d; 

{ 

int  s; 

if  (d  <=  -0.1  )  s  =  1 

else  if  ((  d  >  -0.1)  &  (d  <=  0,165  ))  s  =  0 

else  s  =  -1 

r  e  t  u  r  n  ^  s  )  ; 

/*  Select  s2  */ 
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seitwo  (d) 
double  d; 

{ 

int  s ; 


if  (  d  <=  -0.33  )  s  =  2 
else  if  ((  d  >  -0.33  )  &  (d  <=  -0.1  ))  s  =  1 
else  if  ((  d  >  -0.1  )  &  (d  <=  0.1  ))  s  =  0 
else  if  ((  d  >  0.1  )  &  (d  <=  0.39  ))  s  «  -1 
else  s  s=  -2 


return (s)  ; 

} 

/*  Select  s3  */ 

seltre  (d) 
double  d; 

{ 

int  s; 


if  ( 

d  < 

= 

-0 

.36  ) 

s  = 

2 

else 

if 

( ( 

d 

>  -0.36 

)  &  (d  <=  -0.12  )  ) 

s  = 

1 

else 

if 

( ( 

d 

>  -0.12 

)  h.  (d  <=  0.12  )) 

s  = 

0 

else 

if 

( ( 

d 

>  0.12  ) 

&  (d  <«  0.36  )) 

s  » 

-1 

else  s  •=  -2; 

return (s)  ; 

} 

/*  Select  */ 

select  (d,  q,  div) 
double  d,  div; 
int  q; 

{ 

int  s,  k; 

double  rtrunc,  dtrunc; 
k  =  K; 

/*  Remainaer  truncated  to  6  bits;  divisor  replaced  by  1  */ 

s  =  d  *  64.0;  rtrunc  =  s;  rtrunc  =  rtrunc  /  64.0; 
s  =  div  *  64.0;  dtrunc  =  s;  dtrunc  =  dtrunc  /  64.0; 
dtrunc  =  1.0; 

rtrunc  =  (  rtrunc  -  q  *  dtrunc  >*  4.0; 

if  (rtrunc  >0)  f  s  =  rtrunc  0.5;} 
else  s  =  rtrunc  -  0.5; 

return (s ) ; 
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} 
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5.  Example 


dl 

_ 

xO 

=  0.5000000000 

0.0000000000,  yl 

si 

* 

0 

d2 

s 

0.0000000000,  y2 

s2 

= 

0 

d3 

s 

0.0000000000,  y3 

s3 

= 

0 

d4 

0.0000000000 

xprime  “ 

1.0000000000  , 

i  Remainder  q 

predicted  next  q  =  1 
1  0.1480357308  0 


predicted  next  q  =  -2 

2  0.5921429234  1 

predicted  next  q  »  2 

3  -1.6314283066  -2 


predicted  next  q  =  -2 

4  1.4742867738  2 

predicted  next  q  =  0 

5  -2.1028529050  -2 

predicted  next  q  ®  -2 

6  -0.4114116198  0 

predicted  next  q  =  1 

7  -1.6456464794  -2 

predicted  next  q  =  2 

8  1.4174140826  1 

predicted  next  q  *  -1 

9  1.6696563302  2 

predicted  next  q  ®  -1 

10  -1.3213746790  -1 

predicted  next  q  =  -1 

11  -1.2854987162  -1 

predicted  next  q  =  -1 


0.0740178654,  Q  = 
0.1480357308 

0.1480357308 

0.1480357308 


yprime  =  0 

Quotient" 

.1480357308,  ql 

Error 

0.0000000000 

0.1480357308 

0.2500000000 

-0.1019642692 

0.1250000000 

0.0230357308 

0.1562500000 

-0.0082142692 

0.1484375000 

-0.0004017692 

0.1484375000 

-0.0004017692 

0,1479492188 

0.0000865121 

0.1480102539 

0.0000254769 

0.1480407715 

-0.0000050406 

0.1480369568 

-0.0000012259 

0.1480360031 

-0.0000002723 

0.148035/308 


0 
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.2  -1.1419948646 

-1 

0.1480357647 

-0.0000000339 

predicted  next  q  = 

2 

13  -0.5679794585 

-1 

0.1480357051 

0.0000000258 

predictea  next  q  = 

-1 

14  1.7280821658 

2 

0.1480357349 

-0.0000000041 

predictea  next  q  = 

0 

15  -1.0876713367 

-1 

0.1480357312 

-0.0000000003 

predictea  next  q  = 

-1 

16  -0.3506853469 

0 

0.1480357312 

-0.0000000003 

6.  Binary-level  Implementation 

{to  be  done  } 

7.  Pertormance  Analysis 

(to  be  done} 

8.  Alternatives 

For  transformation  part: 

-  Have  a  small  table  of  reciprocals  of  the  truncated  divisor, 
perhaps  to  4-6  bits;  use  three  stages  of  CSAs  to  multiply 
the  divisor  (2  bits  per  stage  of  the  reciprocal) ;  propagate 
carries  to  get  the  transformed  divisor;  repeat  for  the  dividend 
but  do  not  propagate  carries. 

-  Use  radix-2  in  the  transformation  part;  possibly  much  simpler 
implementation. 
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-  Use  radix-16  in  the  transformation  part  -  details 
worked  out  on  the  binary  level;  possibly  fewer  steps. 

For  recursion  part; 

-  Implement  two  steps  in  one  clock  period;  double 

the  combinational  logic  (  CSAs,  selection  and  multiple  generator) . 
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APPENDIX  B 


RADIX-4  DIVISION  WITH  RANGE  TRANSFORMATION 


M.  Eroegovac  and  T.  Lang 
Augmt,  1984  (Modified  August  22) 


The  division  algorithm  described  has  the  following  characteristics: 

•  Is  of  the  typical  recurrence  type  with  a  redundant  quotient  representation 
with  digit  set  txtw'cen  -p  and  p. 

•  Simplifies  the  quotient  selection  by  restricting  the  range  of  the  divisor  to  be 
between  1-a  and  1+a. 

•  Improves  the  speed  of  execution  by  predicting  the  quotient  digit. 

The  execution  consists  of  two  phases  (Figure  1): 

(1)  Transformation  of  the  divisor  X  into  the  range  (1-a)  s  X  s  (1+a)  and 
adjustment  of  the  dividend. 

(2)  Recurrence  to  obtain  the  quotient. 

In  the  next  section  we  describe  the  algorithm  and  determine  the  value  of  a  re¬ 
quired  to  allow  the  quotient  selection  to  be  done  by  rounding.  By  reducing  a  further  it 
is  possible  to  perform  the  roxmding  on  a  limit^  precision  estimate  of  the  partial 
remainder.  Fm^y  we  consider  the  possibility  of  predicting  the  quotient  digit.  For  relat¬ 
ed  references  see  [ERCE83J. 


The  Recurrence  Step  and  the  Value  of  a 

This  part  of  the  algorithm  consists  in  computing  the  sequence  of  partial 
remainders 

R{.+  11  =  r(R[il  -,p[)  (1) 


where 

X  is  the  divisor, 

^[0]  =  K  is  the  dividend, 

7,  is  a  digit  of  the  quotient  Q  -  qQ.qiq2  '  '  <im  with  -p  rs  s  p. 
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The  quotient  digit  q^  is  selected  so  that 


|R[<I-«,I  =  3  (2) 

with  3  a  constant  to  be  determined. 

Since  the  quotient  digit  is  in  the  range  -p  to  p,  this  selection  implies  that 

(/?(0|  s  p+3 

We  now  determine  the  restriction  in  the  range  of  the  divisor  for  which  this  bound 
on  the  partial  remainders  is  satisfied,  and  show  that  in  this  case  the  computation  of  the 
quotient  is  correct. 

Since 

=  r(/?[i]  -  q^) 


we  can  write 

/?[«>!]  =  r(/?[i]-^,)  +  r{\-X)qi 

and  since  ^  3,  (l-o)  s  X  s  (1+a),  and  :S  p  we  get 

[/?[/+ IJI  s  r3  +  rap 

Consequently,  for  |/?[<  +  l]l  s  p+3  it  is  sufficient  that 

r3  +  rap  S  p  +  3 


which  results  in 

a  ^  (l/r)[l  -  3(r-l)/p]  (3) 

Now  we  show  that  this  value  of  a  results  in  a  correct  quotient.  The  algorithm 
computes  the  correct  quotient  if 


uiLh 


\(Y/X)  -  Q\<  r-"* 


e  =  2  q,r-* 

i~0 
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and 


By  expansion  of  the  recurrence  we  get 

Y  =  X2  qr'*  +  + 

<-0 


Therefore, 


(Y/X)  -  C?  = 

\(Y/X)  -  Q\s  r""-^ 


(/?[m+ll/X) 


Consequently,  the  quotient  is  correct  if 

|*U  • 

Introducing  the  bounds  on  J?[/n+ 1]  and  on  X,  we  get 

(p+0)/(l-a)  s  r 
which  is  satisfied  for  the  bound  on  a  obtained  before. 
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Choice  of  3  Precision  of  Remainder  Estimate 

Since  is  an  integer,  it  is  necessary  (from  (2))  that  For  this  value  of  ^  a 

full  precision  partial  remainder  is  requir^  for  the  selecticm  of  9/.  A  larger  value  of  3 
permits  the  use  of  an  estimate  of  reduced  precision.  Let  R[i]  be  an  estimate  of  R[i]  to 
be  used  for  the  determination  of  the  quotient  digit.  Assume  that 

R[t]  s  ^[i]  i  ie[i]  +  & 

Then  to  assure  that  |R[/]-9j|  2  3  it  is  sufficient  that 

^  P 

and 

l«[/]  +  8  -  9,1  :s  3 


Again,  since  9,  is  an  integer  the  smallest  bound  on  |R[<]-9,|  is  1/2,  obtained  by 
using  rounding  on  R,  resulting  in 


|8  +  y2|  s  3 

and  therefore 

8  s  3  -  1/2 

Consequently,  to  reduce  the  precision  of  1?  it  is  convenient  to  increase  3*  On  the 
other  hand,  increasing  3  reduces  a  and  therefore  requires  more  preadjusting  steps.  El¬ 
iminating  3  from  the  expression  (3)  for  a  we  get 

a  s  (l/r)[l  -  (8  +  l/2)(r-l)/p]  (4) 

In  summary,  in  order  to  have  -  9J  ^3  it  necesary  to  prcprocess  X  into 
the  range  (l-a)iXs(l+a)  and  to  use  an  estimate  /?[ij  with  precision  8  to  compute  r, 
according  to  the  function 

(round(R[i])  If  integer{\R[i]\)  <  p 
\integer{R[i\)  If  integer {\R[i]l)  =  p 
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Quotient  Digit  Prediction 


The  division  process  outlined  consists  of  a  sequence  of  iterations.  Each  of  these 
iterations  is  formed  of  three  steps  (see  Figure  2): 

•  Determination  of  the  remainder  estimate  /{[t]  in  form 

•  Determination  of  a  quotient  digit  (rounding) 

•  Selection  of  a  divisor  multiple  g^X 

•  Subtraction  to  obtain  new  partial  remainder  /?[/+!]  in  carry-save  form 
The  time  of  an  iteration  step  is 

T  —  +  f,  +  /g,  +  q 

where 

tg  =  time  for  assimilation  of  i?[i]  ‘ 

=  time  to  rotmd 

tj  =  time  to  select  the  ^visor  multiple 
=  time  of  subtraction  in  carry-save  form 
ti  =  time  to  load  the  registers 

To  reduce  the  time  of  an  iteration  step  it  is  possible  to  precompute  the  quotient 
digit  in  the  previotis  iteration  step.  This  results  in  an  iteration  step  consisting  of  two 
parallel  paths.  In  one  the  next  partial  remainder  is  obtained  while  in  the  other  the  next 
quotient  digit  is  computed  (Flgtire  3a). 
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Using  the  quotient  calculation  procedure  presented  before,  the  quotient  digit 
depends  on  /?[/+ 1].  In  order  to  predict  this  digit  it  is  necessary  to  base  the  prediction  on 
/;[/]  (and  maybe  X)  since  /?[/+!]  has  not  been  computed  yet.  Since 

J«[i+1)  =  r(«W-9**) 
it  is  possible  to  determine  qi+i  by 

=  round(R[l+l])  =  round{r{R[i\- q^)) 
which  could  be  approximated  by 

qi+l  =  round(assim(r(R[i]-q^)) 

(where  round(a)  is  p  if  azp.) 

This  prediction  does  not  produce  a  significant  reduction  in  time  since  the  path  re¬ 
quires  the  same  steps  as  the  iterative  step  without  prediction;  selection  of  the  multiple, 
subtraction,  assimilation,  and  rounding  (Figure  3b). 

A  more  promising  approach  is  to  introduce  an  additional  approximation  and  com¬ 
pute  the  digit  quo*  ent  as 

q,+i  =  round(r(Rli]-q,)) 

This  eliminates  the  step  of  selecting  the  multiple  of  the  remainder  and  simplifies 
the  subtraction,  since  q^  is  an  integer  (Figure  4a).  The  time  of  a  step  is  now 

T  =  max{t^  + 

where  now  includes  the  subtraction  of  q,  and  the  rounding. 
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If  the  path  for  calculating  is  longer  than  that  to  compute  /?[i-f  IJ,  it  is  possi¬ 
ble  to  balance  the  two  paths  by  including  the  assimilation  in  the  second  path  (Figure  4b) 
and  store  /{[i].  In  addition,  to  reduce  the  critical  path,  it  is  possible  to  use  faster  circuts 
in  the  slice  required  to  compute  R  (even  duplicating  this  slice  to  reduce  the  complexity  of 
the  interconnection  might  be  convenient).  In  this  case  the  time  is 

T  =  max{t^  +  +  t^s  ^  ti) 

Since  this  procedure  of  quotient  digit  prediction  introduces  an  additional  approxi¬ 
mation  (using  qi  instead  of  qpC)  it  produces  the  correct  quotient  if  the  divisor  range  is 
further  restricted,  that  is  an  additional  limitation  on  a  is  introduced.  We  now  determine 
this  restriction. 

The  basic  recurrence  for  i  + 1  can  be  written  ^  - 

«[/+21  =  r(/i[,  +  ll  - 
Replacing  R[i  + 1]  in  tenns  of  R[i]  we  get 

^[i  +  2]  =  r(r(i?[jJ  -  qPi)  -  9,+iX) 

This  can  be  transformed  into 

R[i^2]  =  r(r(/?[/]  -  q,)  -  -K  r(l-X)<?,  +  (l-^)9.>l) 

Since  the  prediction  is  done  so  that 

I '’(/?[']  -  <7,)  -  I  s  3 
and  |l-Afl:Sa,  we  obtain 

[/?[i  +  2]]  s;  r(3  +  rap  +  ap) 
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Consequently,  for  ll?[j+2lsp+3,  we  need 

P  +  3  a  r(3  +  (r  +  l)pa) 

which  results  in 

-  (-l)f)  (5) 

which  is  l/(r  + 1)  times  the  value  without  prediction. 


Prediction  of  the  Quotient  Digit  and  Approximation  of  the  Remainder  Estimate 


To  reduce  the  critical  path  in  Figure  4b  it  is  possible  to  compute  an  approziinatioo 
5[<  +  l]  of  instead  of  the  exact  value  (of  course  this  would  require  a  further 

reduction  of  the  range  of  X  to  get  a  correct  quotient).  A  suitable  expression  for 
is 

S[i+1]  = 

The  calculation  of  5[i+l]  is  simpler  than  that  of  /?[/+’!]  because  it  does  not  re¬ 
quire  selection  of  the  multiple  and  because  the  subtraction  of  9^  is  simpler  than  the  sub¬ 
traction  of  since  q^  is  an  integer.  The  resulting  time  is  (Rgure  5) 

T  ~  max(r^  ^es 


The  restriction  on  the  range  of  X  is  now 


This  value  of  a  is  small  and  would  require  many  transformation  steps  for  the 
divisor. 
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Range  Transformatioo  of  the  Divisor 

The  previous  algorithm  requires  the  divisor  to  be  transformed  into  the  range 
(1-a)  ^  (l+o). 

This  transformation  can  be  done  by  the  following  recurrence: 

with  1/2  s  Jf[0]=X  <  1  and  1-2"^  <  X[pJ=X‘  <  1+2“^. 

The  selection  of  Jj+i  is  done  by 

'  1  if  Aro[i]=0  and  X,+i[»l=0 
=  -1  if 

0  otherwise 

Since  “  Taorc  convenient  to  define 

z[i]  =  2W1-1) 

and  perform  the  equivalent  recurrence 

z[i  +  lj  =  2z[/J  +  J/+i^[0] 
with  r[0]=X-l  and  X‘  =  X{p]  =  2-Pz[p]  +  1. 

Now  the  selection  is 
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1  If  ri[il=l  and  Z2[*]~0 
j,+l  =  -1  If  ri[«l=0  and  z:ii]=l 
0  otherwise 

We  now  determine  the  selection  intervals  and  show  that,  since  the  intervals  over* 
lap,  it  is  possible  to  use  a  z  with  limited  precision  for  the  selection. 

Since  l-2“*  <  X[*J  <  1+2“*,  the  range  of  r[ib]  is 

-1  <  r[ik]  <1 

The  selection  intervals  are  determined  by  solving 

z[i]  -  1/2(2[/+11  -  r,+iX[0]) 

for  z[i  +  ll  =  ±1  and  j,+i  =-1,0,1. 

Consequently 

=  1  If  z[/l  €  (-(i+;c[0])/2,  (1-j:[0])/2) 
j,>i  =  0  If  z[/l€  (-l/2,l/2) 

=  -1  if  z[i]  €  ((-l+X[0])/2,  (l+X[0])/2)) 

Since  1/2  ^  Al[0]  <  1,  we  obtain  the  irter/als  of  Figure  6a.  Consequently,  the  fol¬ 
lowing  selection  rule  results 

1  If  z[ils-l/4 

=  0  L"  -y4<z[i]<l/4 

-1  z[i]2:1/4 

This  results  in  an  overlap  of  8  =  1/4  so  that  it  is  necessary  to  assimilate  over  posi¬ 
tions  0,  1,  2,  and  3. 


B-ll 


Hie  dividend  has  to  be  adjusted  in  accordance  to  the  transformation  of  X.  That 
is, 

TnctfaH  of  adjusting  the  dividend,  it  is  possible  to  adjust  the  quotient.  That  is, 
compute  Q*  ==  Y/X“  and  obtain  the  true  quotient  as 

Q  =  Q\1  + 

The  transformation  process  consists  of  the  following  steps  (Figure  6b); 

(1)  Compute  z[0J  =  X[0]-1. 

For  i=0,...,p  do 

(2)  Determine  from  an  estimate  of  z[»]. 

(3)  Select  the  corresponding  multiples  of  X  and  Y.  4- 

(4)  Compute  r[/+l]  and  F[i+lJ. 


(5)  Compute  X*  =  z\p\l~P  +  1 


Ra<iix-4  Implementatioiu  (with  TTL  Timings) 


We  now  discuss  the  time  and  complexity  for  several  radix-4  implementa¬ 
tions.  To  compare  them  with  the  TTL  design  presented  by  Taylor  [TAYL81]  we 
give  timings  using  Taylor’s  delay  estimates. 

We  choose  the  following  parameters: 

r=4y  p=2,  6=1/8.  • 

This  results  in 

3=5/8  and  a  =  1/64  (without  prediction)  and  a =1/320  (with  prediction). 

The  choice  of  p  limits  the  divisor  multiples  required  to  -2, -1,0, 1,2. 

The  choice  of  8  requires  a  (assimilated)  remainder  estimate  of  7  bits  (3  for  the  in¬ 
teger  part  and  4  for  the  fraction  since  a  truncation  of  the  carry-save  partial  remainder 
after  the  i-th  bit  produces  an  error  of  2x2'^.  A  better  possibility  is  to  truncate  after 
the  6th  bit  and  add  a  carry  to  this  6th  bit.  Thu  produces  an  error  of  ±2~^. 

If  the  resulting  a  is  too  small,  especially  with  prediction,  it  is  possible  to  increase 
6  to  1/16  resulting  in  3=9/16  and  a  =  1/128  (with  prediction).  The  increase  in  8  results  in 
a  remainder  estimate  of  8  bits  (or  7  if  the  ad(  ion  of  the  carry  is  done). 

A)  No  prediction:  carry-save  remainder  and  carry-propagate  estimate 
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In  this  design  we  implement  the  partial  remainder  in  carry-save  form  and  com¬ 
pute  the  estimate  using  a  carry-prc^gate  adder  (CPA)  of  four  bits  and  a  two-level  net¬ 
work  to  compute  the  carry  into  the  4-bit  slice  (only  4  bits  of  the  estimate  arc  required 
for  the  rounding).  The  selection  of  the  quotient  digit  is  performed  by  rounding  the  esti¬ 
mate  (Figure  7).  The  time  estimate  for  TTL  is: 

-  determination  of  estimate  (4-bit  CPA  and  two-level  network)  30  ns. 

-  rounding  (two  gate  levels)  12  ns. 

-  select  multiple  19  ns. 

-  carry  :  ave  subtraction  12  ns. 

-  set  register  5  ns. 

TOTAL  78  ns. 

4  - 

This  time  can  be  reduced  to  70  ns.  if  the  selection  of  the  multiple  is  done  by  vec¬ 
tor  AND  gates  and  a  decoded  quotient  (11  ns.  instead  of  19  ns.) 
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B)  Slice-save  partial  remainder  and  estimate 


In  this  design  the  partial  remainder  is  cxnnputed  using  slices  of  2  bits  with  t  e  car- 
lies  between  slices  saved  (Figure  8)  (this  2*bit  slice  is  compatible  with  the  radix-4  design, 
a  4-bit  slice  does  not  seem  possible). 

The  estimate  is  computed  from  the  9  bits  corresponding  to  the  three  most  signifi¬ 
cant  slices  (only  6  bits  of  the  remainder  have  to  be  assimilated  in  this  case  since  there  is 
just  one  bit  in  the  7th  position  (Figure  8b)).  The  estimate  can  be  computed  using  a  4-bit 
CPA  and  a  S-input  AND  gate  (Figure  8b). 

The  TTl.  time  is  now: 

-  determination  of  estimate  (4-bit  CPA  and  one  gate)  24  ns. 

-  rounding  (two-level  network)  12  ns. 

-  select  multiple  19  ns. 

-  subtract  (2-bit  slice)  12  ns. 

-  set  register  5  nsec. 

TOTAL  72  ns. 

Again  the  time  can  be  reduced  to  66  ns.  by  reducing  the  time  for  selection. 

C)  PLA  for  quotient  generation 

In  this  design  the  remainder  is  computed  in  carry-save  (1-bit  or  2-bit  slices)  form 
and  the  6  most  significant  bits  (12  b:  s  or  9  bits)  arc  used  directly  for  the  quotient  gen¬ 
eration  (with  a  PLA)  (Figure  9).  The  PLA  probably  has  many  AND  terms  since  addition 
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is  involved  in  the  function. 


The  TTL  time  is: 

•  computing  qi  (rounding)  (PLA)  2S  ns. 

-  select  multiple  19  ns. 

.  subtract  (carry-save)  12  ns. 

•  set  register  5  ns. 

» 

TOTAL  61  ns. 

Reducing  the  time  of  selection  we  get  in  this  case  53  ns. 

D)  Second-level  prediction 

As  trsntioned  before  this  prediction  uses  * 

qi^i  =  round  (assim  (r(R[/]-g,))) 

For  the  values  of  th>ese  designs  the  reduction  of  a  is  from  1/64  to  1/320. 

In  this  case  there  are  two  concurrent  paths:  the  computation  of  ^,>1  and  that  of 
/?[<>!].  As  mentioned  in  section  x,  the  computation  of  the  estimate  can  be  included  in 
ary  of  the  two;  the  choice  being  made  in  such  a  way  that  the  critical  path  is  reduced. 
We  consider  both  possibilities. 


Scheme  I:  Estimate  calculation  in  path  (Figure  10). 


The  two  paths  are  as  follows: 


i)  Estimate  and  calculation  of  qi^i. 

This  path  includes  the  computation  of  the  estimate,  the  subtraction  of  9,  and  the 
rounding.  The  computation  of  the  estimate  can  be  done  as  in  cases  A)  or  B)  (Figures  7 
and  8). 

The  subtraction  of  and  the  rounding  is  done  as  follows:  Let  us  call 

t 

P  =  Since  we  subtract  and  then  multiply  by  4  (and  qi  is 

an  integer),  the  sign  of  P  is  obtained  by  subtracting  Qq  from  Rq,  that  is, 

P-2  = 

The  value  of  P  is  obtained  by  shifting  P[i],  that  is. 

^  • 

The  quotient  digit  is  obtained  by  rounding  P  if  it  is  smaller  than  2  and  by  the  in¬ 
teger  pan  of  P  if  it  is  equal  or  larger  than  2.  This  results  in  the  following  table: 
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From  the  table  the  following  expressions  result: 

s  * 

Q-Ji+lJ  =  F-2(/ei'  +  Ri  +  Ri) 

=  F_2Ri'  +  RiRi  +  R\Ri  +  R\R-^-!> 

Go[/+l]  =  ir-i  +  Ri)ir-1  +  Ri){Ri  +  «‘3)(^‘2'  +  Ri) 

Substitutmg  F_2  in  these  expressicns  it  is  clear  that  the  subtraction  and  rounding 
can  be  performed  in  two  gate  levels. 

The  timing  for  this  scheme  is 

-  determination  of  the  estimate  (like  in  A  or  B)  30  ns.  or  24  ns. 

-  subtraction  of  q^  and  rounding  (2  levels)  12  ns. 

•  set  register  S  ns. 

TOTAL  47  ns.  or  41  ns. 
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As  in  C)  the  complete  calculation  of  the  quotient  could  be  done  using  a  PLA  with 
12  inputs  (or  9  inputs  if  2-bit  slices  are  used)  resulting  in  a  total  time  of  25+5=30  ns. 
As  comented  there,  since  addition  is  involved  the  PLA  might  have  many  AND  terms. 

ii)  Calculation  of  the  partial  remainder 

-  selection  of  the  multiple  19  ns. 

-  subtraction  (1-bit  or  2-bit  slice)  12  ns. 

-  set  register  5  ns. 

TOTAL  36  ns. 

Again  the  selection  can  be  reduced  to  11  ns  resulting  in  a  total  of  28  ns. 

The  longest  path  is  therefore  30  ns  if  the  PLA  is  used  and  41  ns  if  it  is  not  used. 

4  ' 

If  the  PLA  cannot  be  used,  the  critical  path  can  be  reduced  by  using  faster  cir- 
caits  in  the  calculation  of  the  estimate  and  in  the  rounding.  For  example  using  FAST 
circuits  would  produce  a  dela;;  of... 

Scheme  II:  Estimate  calculation  as  part  of  partial  remainder  path 

From  scheme  I  it  can  be  seen  that  moving  the  calculation  of  the  estimate  to  the 
partial  remainder  path  would  ler.  jihcn  it  excessively  (at  least  for  the  TTL  timings  being 
considered). 

What  could  be  done  is  to  move  part  of  the  estimate  calculation.  An  attractive 
possibility  would  be  to  includ .  here  the  calculation  of  the  carry  into  the  4-bit  slice  (one 
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gate)  and  to  compute  also  the  p’s  and  g’s  of  the  4-bit  slice  (Figure  12).  This  would  add 
one  level  to  the  partial  remainder  path  and  reduce  one  level  from  the  quotient  path.  The 
result  would  be  that  both  paths  would  be  approximately  of  35  ns.  Of  course,  different 
balances  can  be  achieved  if  some  of  the  circuits  are  of  a  faster  technology. 
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Riinge  Tramfcnnotion  cf  the  Divisor 
M.  Erccgovac  and  T.  Lang 
December  20,  1984 

The  division  algorithm  discussed  in  the  previous  reports  [ERLA84]  requires  that 
the  divisor  be  transformed  into  the  range  (1-a)  sx  s  (l+o)-  In  [ERCE83]  a 
transformation  based  on  the  continued  product  normalization  algorithm  [ERCE72, 
ERCE73]  is  given.  The  implementation  of  this  algorithm  corresponds  to  a  recurrence 
which  is  difficult  to  speed-up  sufficiently  with  the  available  hardware  complexity.  We 
also  considered  several  radix-2  iterative  algorithms  for  the  divisor  transformation.  The 
onc-bit-per-step  alternative  has  an  undesirably  long  step  time  while  that  with  the  predic¬ 
tion,  on  the  other  hand,  requires  very  complex  selection  rules.  All  of  the  above  men¬ 
tioned  approaches  are  characterized  by  a  sequential  generation  of  digits  used  in  the  divi¬ 
sor  transformation  thus  precluding  any  overlap  between  the  steps.  As  an  alternative 
that  has  more  potential  for  a  faster  transformation,  we  consider  here  an  approach  based 
on  a  reciprocal  approximation  by  power  series.  The  method  consists  of  two  parts: 

(i)  Compute  Af ,  an  approximation  to  the  reciprocal  of  the  divisor  X,  such  that 

jM  -  iyX|  s  a/X 

(ii)  Multiply  the  divisor  by  M  to  obtain  X*,  such  that 

X*  =  XAf  and  [X*  -  1]  s  a 


1.  Power  Series  Approximation 

Let  R  =  1/D  where  1  ^  D  <  2  is  2X.  That  is, 

D  =  (l_r  2X3... 0:^4+ 

and 

R  =  (0. 1x2^3  ■  •  •  rj 


To  compute  R,  we  decompose  D  such  that  D  =  Xj  +  2“*X2  where 

X  -  (I.J3  •  1  s  Xj  <  2  -2"* 

A':  -  0  O2  <  1  -2-'''-** 


] 


By  McLauriu’s  scries  expansion  we  have 


D  X,  +  2'% 
1  1 

1  +  2-%-^ 

■*1 

^ - 

1  +  2~^X^i 


=  /?i[l  -  2'*X2/?i  +  -  .  ■  .  ] 


For  the  approximation  of  R  we  use  the  first  two  terms  of  the  expansion  and  trun* 
cate  the  result  to  t  bits.  We  get 

R  =  Rl-  2-^RfX2  - 

where 

is  /?!  truncated  to  u  bits, 
is  (^i)^  truncated  to  s  bits, 

Xi  ViXi  truncated  to  v  bits,  and 
0  s  e,  <  2~'  is  the  truncation  error. 

We  now  compute  a  boimd  for  the  approximation  error  e  such  that 

R  =  R  +  e 


This  error  can  be  written  as 


where 


e  =  ct  ^  ep  ^ 


-  ej-  is  the  error  due  to  the  use  of  only  two  terms  of  the  scries, 

-  ep  is  the  error  due  to  the  truncation  of  /?i,  and 

-  ej  is  the  error  due  to  the  tnmeation  of  r}  and  of  X2. 


We  have  (*) 


R  =  (^i+2"“)  -  2"*(^f+2~0(X2+2"'')  +  er 

which  results  in 

J?  =  ^  +  2"'  +  2"“  -  +  tf 


Consequently,  since  s  1-2“'  andJif2  s  1-2"'', 

—  (2"^*'*'''^  +  2“^*‘'’'^  i  e  SI  2~‘  +  2"“  +  2"^ 

The  choice  of  k,  t,  s,  and  v  should  be  *n«A»  to  that 

\e\  2  a/2  since  hi  -  2R 


Let 


a  =  2-P 


then 

2"'  +  2"“  +  2~^  s  2"^"^ 


and 

s  2“^"^ 


Several  choices  for  k,  t,  u,  and  v  are  possible  to  satisfy  these  conditions.  The 
selection  should  simplify  the  implementation.  For  example  for  a  =  2~^,  the  following 
are  possible  choices: 

4=4,  /=9,  4=9,  j=4,  and  v=4 
4=5,  t=9,  4  =  8,  J=3,  and  v=3 
4=6,  /=8,  4  =  9,  s=2,  and  k=2 


For  a  =  2"^: 


4  =  5,  r=9,  4  =  10,  j=4,  v  =  4 
4  =  6,  t=9,  4  =  10,  j  =  3,  v  =  3 


(*)  To  simplify  the  notatioD  some  of  the  errors  are  denoted  by  their  marltruim  values. 
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2.  Reciprocal  Generation  for  a  =  2“^  (p  =  6) 

Wc  now  consider  a  detailed  design  for  p  ==  6.  We  choose  k=5,  f=9,  u=8,  j=3, 
and  v=3. 

Table  1  displays  the  8-bit  truncated  reciprocal  and  its  3-bit  truncated  square. 


Table  1 


r2 

W0W1W2W3 

1.00000 

0.11111111 

0.111 

1.00001 

0.11111000 

0.111 

1.00010 

0.11110000 

0.111 

1.00011 

0.11101010 

0.110 

1.00100 

0.11100011 

0.110 

1.00101 

0.11011101 

0.101 

1.00110 

0.11010111 

0.101 

1.00111 

0.11010010 

0.101 

1.01000 

0.11001100 

0.100 

1.01001 

0.11000111 

0.100 

1.01010 

0.11000011 

0.100 

1.01011 

0.10111110 

0.100 

1.01100 

0.10111010 

0.011 

1.01101 

0.10110110 

0.011 

1.01110 

0.10110010 

0.011 

1.01111 

0.10101110 

0.011 

1.10000 

0.10101010 

0.011 

1.10001 

0.10100111 

0.011 

1.10010 

0.10100011 

0.011 

1.10011 

O.IOIOOOOO 

0.011 

1.10100 

0.10011101 

0.010 

1.10101 

0.10011010 

0.010 

1.10110 

0.10010111 

0.010 

1.10111 

0.10010100 

0.010 

1.11000 

0.10010010 

0.010 

1.11001 

0.10001111 

0.010 

1.11010 

0.10001101 

0.010 

1.11011 

0.10001010 

0.010 

1.11100 

0.10001000 

0.010 

1.11101 

0.10000110 

0.010 

i. 11110 

0.10000100 

0.010 

1.11111 

0.10000010 

0.010 
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Tliesc  functions  can  be  implemented  using  a  5-input,  10-output  PLA-  If  this  is 
not  leasable,  the  table  can  be  decomposed  into  subtables. 

As  indicated,  the  reciprocal  R  is  approximated  by 

A  A  _ _ i|A^A 

R=Ri-  2  *RtX2 

To  perform  the  subtraction  we  could  complement  the  subtrahend  and  add.  How¬ 
ever,  this  would  require  a  range  extension  of  this  subtrahend,  which  complicates  the  im¬ 
plementation.  It  seems  better  to  complement  the  first  term  (it  can  be  obtained  in  this 
form  from  the  PLA),  add,  and  then  complement  the  result.  The  use  of  ones*  comple¬ 
ment  seems  better  since  it  avoids  the  addition  of  1  in  the  complementation  of  the  result 
(note  that  no  end-arotmd-carry  is  produced  during  the  addition  since  zf  =  0).  The  trail¬ 
ing  I’s  produced  by  the  complementation  of  can  be  avoided  by  incrementing  a  unit  in 
the  last  significant  position  of  the  complement  of  Ri 

The  configuration  of  the  addition  is  shown  in  Figure  1.  The  multiplication  of  Ri 
by  ^2  implemented  by  the  addition  of  three  partial  products. 

Since  the  reciprocal  approximation  will  be  used  to  multiply  the  divisor  in  order  to 
obtain  X*,  and  a  radix-4  multiplication  is  to  be  used,  it  is  necessary  to  recode  the  re¬ 
ciprocal  approximation  to  a  radix-4  representation  with  digit  set  {-2,-l,0,l,2}.  Due  to  the 
fact  that  the  most  significant  bits  of  the  adder  have  only  one  operand  different  from 
zero,  it  is  possible  to  perform  the  recoding  on  the  two  nx»t  significant  radiz-4  digits 
without  waiting  for  the  carry  propagation  from  the  other  digits.  This  is  convenient  since 
we  are  going  to  perform  the  multiplication  beginning  with  the  most  significant  digit  of 
the  reciprocal  approximation;  that  is,  the  multiplication  can  begin  before  finishing  the 
addition. 

To  simplify  the  recoding  the  six  most  significant  bits  of  R^  are  computed  in  nor¬ 
mal  (not  complemented)  form.  Table  2  shows  the  resulting  Z,  obtained  by  complement¬ 
ing  adding  1  in  the  least  significant  position  and  complementing  again  positions  0  to 
5. 


Table  2 


^0*1*223*4^5^6'^7'^8' 

WoWiW2W3 

1.00000 

0.11111001 

0.111 

1.00001 

0.11110000 

0.111 

1.00010 

0.11100000 

0.111 

1.00011 

0.11101110 

0.110 

1.00100 

0.11100101 

0.110 

1.00101 

0.11011011 

0.101 

1.00110 

0.11010001 

0.101 

1.00111 

0.11010110 

0.101 

1.01000 

0.11001100 

0.100 

1.01001 

0.11000001 

0.100 

1.01010 

0.11000101 

0.100 

1.01011 

0.10111010 

0.100 

1.01100 

0.10111110 

0.011 

1.01101 

0.10110001 

0.011 

1.01110 

0.10110110 

0.011 

1.01111 

0.10101010 

0.011 

1.10000 

0.10101110 

0.011 

1.10001 

0.10100001 

0.011 

1.10010 

O.IOIOOIOI 

0.011 

1.10011 

0.10011000 

0.011 

1.10100 

o.rjoiioii 

0.010 

1.10101 

0.10011110 

0.010 

1.10110 

0.10010001 

0.010 

1.10111 

0  10010100 

0.010 

1.11000 

T 100101 10 

0.010 

1.11001 

0.10001001 

0.010 

1.11010 

0.10001011 

0.010 

1.11011 

0.10001110 

0.010 

1.11100 

0.10000000 

0.010 

1.11101 

0.10000010 

0.010 

1.11110 

0.10000100 

0.010 

1.11111 

0.10000110 

0.010 

In  order  to  do  the  recoding  of  the  first  two  digits  without  waiting  for  the  carry, 
the  third  digit  has  to  absorb  the  carry  cj  into  position  S,  as  indicated  in  the  figure.  Sinw 
this  carry  corresponds  to  the  oomplenient  of  the  result,  when  this  carry  is  1,  a  unit  has  to 
be  subtracted  from  the  third  digit.  To  absorb  this  subtraction,  we  ret^e  the  digit  to  the 
set  {-1,0, 1,2}.  The  recoding  is 
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M2 

000 

0 

001 

1 

010 

2 

on 

100 

-1 

101 

0 

no 

1 

111 

-2 

Consequently,  the  recoding  of  b: 


A#i 

000 

0 

010 

1 

100 

(l)(-2) 

no 

(1)(*1) 

001 

1 

on 

2 

101 

111 

_ m 

where  t  = 

Fmally,  the  recoding  for  Mq  is: 


^2 

Afo 

1 

2 

For  the  recoding  of  Af  3  and  we  use  the  corresponding  bits  of  the  result  of  the 
addition,  after  complementation.  This  results  in: 
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^6^ 7^8 

Af3 

000 

"T 

001 

1 

010 

1 

oil 

2 

100 

-2 

101 

-1 

110 

-1 

111 

0 

Using  a  sign-and-magnitude  representation  of  the  radix-4  digits  we  get  the  follow¬ 
ing  switching  expressions: 

A/qi  =  0  Wqj  =  Z2  A/(jo  =  ^2* 

Ml,  =  Z2  Mil  =  2213'#'  +  Z2Z-^  Miq  =  z^t  +  22r  +  z{z^* 

M2j  =  2423  +  24*23'  Af21  =  ■*“  ^425^5  ^20  “  *5'^5  +  ^5^s' 

Ml,  ==  Mil  ~  ^6^ 7*^8*  ■*■  ^6*^ 7^8  ^30  ~  ^7*^8  ^7^8^ 

^4,  =  A/4J  =  ^8^9'  ^40  “  ^9 

A  design  of  the  adder  and  the  recoding  circuit  is  shown  in  Figure  2. 


3.  Divisor  Transformation 

The  divisor  is  transformed  by  multiplying  it  by  M.  Since  the  mo>t  significant  digits 
of  Af  are  ready  first,  and  to  use  the  same  carry-save-adder  used  for  the  division,  we  per¬ 
form  the  multiplication  beginning  with  the  most  significant  digit  of  M.  This  type  of  mul¬ 
tiplication  usu^y  requires  an  adder  of  increasing  precision.  However,  in  t^  case  the 
most  significant  bits  of  X*  are  1.000000  or  0.111111,  so  that  keeping  a  few  extra  bits  is 
sufficient  to  determine  X*. 

To  reduce  the  number  of  multiplication  steps,  it  is  possible  to  load  the  registers 
with  M(PC  and  produce  MqX  +  Af^X  in  one  multiplication  step.  The  configuration  of 
the  multiplication  steps  is  shown  in  Figure  3. 

To  put  the  divisor  in  the  form  required  by  the  division,  it  is  necessary  to  convert 
it  from  the  carry-save  representation  to  a  conventional  representation.  This  step 
corresponds  to  a  carry-propagate  addition. 
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4.  Timing  of  the  Transformation 


The  time  of  the  transformation  is  determined  by  the  following  components: 

a)  Determination  of  and  by  the  FLA.  We  assume  2  gate  delays  for  this 

step. 

b)  Recoding  to  obtain  Mq  and  This  requires  two  gate  delays. 

c)  Multiplication  step  to  obtain  4A/oY  +  MiX.  This  corresponds  to  5  gate  delays. 

d)  Three  more  multiplication  steps  to  multiply  by  Af2>  ^3>  ^4*  assume 

that  the  addition  and  recoding  of  these  ^gits  is  overlapped  with  steps  b)  and  c).  This  re¬ 
quires  5x3  =  IS  gate  delays. 

e)  The  carry-propagate  addition.  We  assume  15-20  gate  delays. 

The  total  is  of  39-44  gate  delays.  This  corresponds  to  8-9  cycles. 
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